RandomForest Hyper Parameters and Attributes

전체 링크

RandomForest

- 여러 개의 결정 트리(Forest)를 만들어서 학습하고, 그 결과를 평균/투표하여 최종 결과를 내는 방법

- 단일 결정 트리가 가진 과적합(Overfitting) 문제를 완화

- 부트스트랩 샘플링 (Bootstrap Sampling) : 중복을 허용한 N개의 샘플을 랜덤하게 뽑아 학습

- 트리를 분할할 때 모든 특성(feature)을 보지 않고, 무작위로 선택한 일부 특성만 고려

- 모든 트리는 서로 다른 데이터와 다른 특성으로 학습되므로, 다양성이 높은 트리 집합 생성

장점

- 여러 트리를 결합하기 때문에 단일 결정 트리보다 일반화 성능이 뛰어남

- 랜덤 샘플링과 랜덤 특성 선택으로 단일 트리보다 과적합 가능성이 낮음

- 수치형, 범주형 모두 처리 가능

- 결측치가 있어도 어느 정도 견고

- 모델이 어떤 특성을 중요하게 여기는지 확인 가능

- 트리들이 독립적이므로, 동시에 학습 가능 (속도 향상 가능)

단점

- 느린 예측 속도

- 많은 트리를 만들어야 하므로, 예측 시 시간이 많이 걸릴 수 있음

- 트리 수가 많아 모델이 커지고, 해석이 어려움 (블랙박스)

- 트리마다 데이터를 저장하므로, 큰 데이터에서는 메모리 부담

하이퍼 파라미터

from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor

RandomForestClassifier(
    n_estimators='warn',
    criterion='gini',
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    min_weight_fraction_leaf=0.0,
    max_features='auto',
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    min_impurity_split=None,
    bootstrap=True,
    oob_score=False,
    n_jobs=None,
    random_state=None,
    verbose=0,
    warm_start=False,
    class_weight=None,
)

RandomForestRegressor(
    n_estimators='warn',
    criterion='mse',
    max_depth=None,
    min_samples_split=2,
    min_samples_leaf=1,
    min_weight_fraction_leaf=0.0,
    max_features='auto',
    max_leaf_nodes=None,
    min_impurity_decrease=0.0,
    min_impurity_split=None,
    bootstrap=True,
    oob_score=False,
    n_jobs=None,
    random_state=None,
    verbose=0,
    warm_start=False,
)

결정 트리와 동일한 하이퍼 파라미터

- criterion : 분할 기준

- splitter : 노드 분할 방법

- max_depth : 트리의 최대 깊이

- min_samples_split : 노드를 분할하기 위한 최소 샘플 수

- min_samples_leaf : 리프 노드에 있어야 하는 최소 샘플 수

- min_weight_fraction_leaf : 리프 노드에 있어야 하는 가중치 샘플의 최소 비율

- max_features : 분할에 사용할 최대 feature 수

- max_leaf_nodes : 최대 리프 노드 수

- min_impurity_decrease : 분할 조건 만족 시 최소 불순물 감소량

랜덤포레스트 하이퍼 파라미터

n_estimators

- 앙상블 트리 개수

bootstrap

- 트리를 만들 때 부트스트랩 샘플을 사용할지 여부

- True : 앙상블의 과적합 방지 효과 상승

- False : 트리가 거의 동일 → 과적합

oob_score

- OOB(Out-of-Bag) 샘플을 사용하여 모델의 일반화 정확도를 추정할지 여부

- True로 설정하면, 부트스트랩으로 선택되지 않은 샘플을 검증에 활용 (별도의 검증 데이터셋을 만들 필요가 없음)

warm_start

- 이전 학습 결과를 재사용할지 여부

- True → 이전 학습 결과를 이어서 추가 트리를 학습

- False → 항상 새로운 랜덤포레스트를 학습

Attributes

    estimators_ : list of DecisionTreeClassifier
        The collection of fitted sub-estimators.

    classes_ : array of shape = [n_classes] or a list of such arrays
        The classes labels (single output problem), or a list of arrays of
        class labels (multi-output problem).

    n_classes_ : int or list
        The number of classes (single output problem), or a list containing the
        number of classes for each output (multi-output problem).

    n_features_ : int
        The number of features when ``fit`` is performed.

    n_outputs_ : int
        The number of outputs when ``fit`` is performed.

    feature_importances_ : array of shape = [n_features]
        The feature importances (the higher, the more important the feature).

    oob_score_ : float
        Score of the training dataset obtained using an out-of-bag estimate.

    oob_decision_function_ : array of shape = [n_samples, n_classes]
        Decision function computed with out-of-bag estimate on the training
        set. If n_estimators is small it might be possible that a data point
        was never left out during the bootstrap. In this case,
        `oob_decision_function_` might contain NaN.

estimators_

- 학습이 완료된 서브 트리(결정트리)들의 리스트

- 랜덤포레스트를 구성하는 각 트리 객체를 확인

classes_

- 클래스 레이블 정보

n_classes_

- 클래스 개수

n_features_

- fit 수행 시 사용된 피처 개수

n_outputs_

- fit 수행 시 모델의 출력 수

feature_importances_

- 각 피처의 중요도(importance)

- 값이 클수록 해당 피처가 모델 결정에 중요함

oob_score_

- OOB(Out-of-Bag) 샘플을 사용하여 추정한 학습 데이터 정확도

oob_decision_function_

- OOB 샘플을 이용해 계산한 결정 함수 값

- n_estimators가 적으면, 일부 데이터 포인트가 부트스트랩 과정에서 한 번도 제외되지 않을 수 있음

- 이 경우 oob_decision_function_에는 NaN 값이 포함될 수 있음

'개발 > Python' 카테고리의 다른 글

GradientBoosting Hyper Parameters and Attributes (0)	2025.11.20
K-Nearest Neighbors Hyper Parameters (0)	2025.11.18
DBSCAN Hyper Parameters and Attributes (0)	2025.11.15
Distance Matrix (1)	2025.11.15
불편추정량 (Unbiased Estimator) (0)	2025.11.15

피로물든딸기의 라이브러리

RandomForest Hyper Parameters and Attributes

RandomForest

'개발 > Python' 카테고리의 다른 글

댓글

티스토리툴바

RandomForest Hyper Parameters and Attributes

RandomForest

'개발 > Python' 카테고리의 다른 글

관련글

댓글

티스토리툴바