Python scikit-learnのランダムフォレストによる回帰分析

　本記事では、sklearnのRandomForestRegressorを用いた回帰分析について記載してます。ランダムフォレストとは、多数の決定木を作成して平均化する手法です。これは、1本の決定木の場合では階層が深くなると過学習してしまうことから考えられたアルゴリズムである。分析データには、Python scikit-learn(サイキットラーン)による重回帰分析 - HK29’s blogでも使用したboston近郊の住宅価格に関するデータセットを用いた。

はじめに、ランダムフォレストを検証した結果を下図に示す。縦軸は決定係数R^2で、横軸はランダムフォレストの実験回数を表し、決定木の階層を1～9と決定木の本数を5～10本で徐々に増やしている。決定係数(適合度)が1に近づいている様子がわかる。

f:id:HK29:20180522001628p:plain

上図の決定係数の変化の様子を実験回数5/15/35回目の3つの場合で比較した。まず、実験5回目の決定係数73％の場合を下図に示す。縦軸に住宅価格、横軸に部屋数とした場合のグラフである。青点は生データを示し、赤点は回帰分析により得られた回帰関数に生データのXを代入してプロットしたものである。決定木の階層が2、本数が6の場合である。

f:id:HK29:20180522001922p:plain

次に、実験15回目の決定係数89％の場合が下図である。決定木の階層が4、本数が6の場合である。

f:id:HK29:20180522002102p:plain

そして、実験35回目の決定係数95％の場合が下図である。決定木の階層が8、本数が6の場合である。

f:id:HK29:20180522011616p:plain

以上の３つの図から、決定木の階層が深くなるにつれて、徐々に再現性が良くなっているのがわかる。過学習の懸念はあるが、そうとも言い切れない。３つ目の図で漸く中央の当てはまりが良くなっているためである(外れ値箇所でなくてと言う意味)。

決定木の設定の目安は階層、本数ともに５本程度のようではある。

▼本プログラムを下記に示す。

一番下のmain関数で、決定木の階層と本数の範囲を指定する仕様にしている。

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score

def load_dataset():
    boston = datasets.load_boston()
    boston_df=pd.DataFrame(boston.data)
    boston_df.columns = boston.feature_names
    boston_df['PRICE'] = pd.DataFrame(boston.target)
    #print("boston_df ->" + str(boston_df))
    X_df = boston_df.drop("PRICE", axis=1)
    Y_df = boston_df.loc[:,['PRICE']]
    X_df.plot()
    Y_df.plot()
    plt.grid()
    plt.show()
    Y = boston_df.PRICE
    print("X_df ->" + str(X_df))
    print("Y ->" + str(Y))
    return X_df, Y

def execute(data, num_of_trees, max_depth):
    X_df = data[0]
    Y = data[1]
    scores=[]
    cnt=0 
    for j in range(max_depth[0], max_depth[1], 1):
        for i in range(num_of_trees[0], num_of_trees[1], 1):
            regr = RandomForestRegressor(n_estimators=i, max_depth=j, random_state=0)
            regr.fit(X_df,Y)
            score = regr.score(X_df,Y) 
            scores.append(score)
            pred_Y = regr.predict(X_df)
            #score2 = accuracy_score(Y, pred_Y)

            print("regr.fit(X_df,Y) ->" + str(regr.fit(X_df,Y)))
            print(score) # R^2
            #print(score2)

            plt.figure(1)
            plt.title('Comparison' + str(cnt))
            plt.xlabel('RM (number of rooms)', fontsize=14)
            plt.ylabel('PRICE (target)', fontsize=14)
            plt.scatter(X_df["RM"], Y, c='blue', label='Raw data')
            plt.scatter(X_df.RM, pred_Y, c='red', label='RandomForest')
            plt.legend(loc='lower right', fontsize=12)
            plt.show()
            cnt += 1

    plt.plot(scores, color="red")
    plt.xlabel('Number of Cycle', fontsize=14)
    plt.ylabel('R^2', fontsize=14)
    plt.grid()
    plt.show()
    
if __name__ == "__main__":
    num_of_trees = [5, 10] # [min, max]
    max_depth = [1, 9]     # [min, max]
    
    data=load_dataset()
    execute(data, num_of_trees, max_depth)