모델의 score - accuracy, f1, rmse, roc

저번 주부터 트리 모델, 앙상블 기반 모델을 배우면서

모델마다 사용하는 score가 달라 혼란스러웠다.

먼저 DecisionTree, RandomForest 모델의 기본 score는 테스트 셋과 타깃 레이블의 mean accuracy이다.

분류 모델이라면,

from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report, confusion_matrix

classifier = DecisionTreeClassifier()
classifier.fit(X_train, y_train)

y_pred = classifier.predict(X_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[142    2]
    2  129]]
              precision   recall   f1-score   support

           0       0.99     0.99       0.99       144
           1       0.98     0.98       0.98       131

 avg / total       0.99     0.99       0.99       275

위와 같은 confusion matrix와 classification report로 f1-score, precision, recall 값을 볼 수 있다.

테스트 셋 하나가 아닌, 교차검증을 통한 종합적인 점수를 얻고 싶다면

from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier(random_state=0)
cross_val_score(clf, data, target, cv=5)

cross_val_score를 사용해서 Cross Validation을 통한 mean accuracy를 구할 수도 있다.

xgboost 라이브러리의 경우,

회귀 모델에서는 rmse
분류 모델에서는 logloss
랭킹 문제에서는 mean average precision을 기본적으로 score로 사용한다.
하지만 auc등을 포함해서, 정말 다양하게 metrics를 바꿀 수 있다.

from category_encoders import OrdinalEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import make_pipeline
from xgboost import XGBClassifier

processor = make_pipeline(
    OrdinalEncoder(), 
    SimpleImputer()
)

X_train_processed = processor.fit_transform(X_train)
X_val_processed = processor.transform(X_val)

eval_set = [(X_train_processed, y_train), 
            (X_val_processed, y_val)]

model = XGBClassifier()
model.fit(X_train_processed, y_train, eval_set=eval_set, eval_metric='auc', 
          early_stopping_rounds=10)

eval_metric=에 원하는 metrics를 넣어주면 된다.

RandomizedSearchCV에서는 scoring 점수를 accuracy, f1, f1-macro, f1-micro, f1-sample, f1-weighted, roc_auc 등 다양하게 변경할 수 있는데,

from sklearn.model_selection import RandomizedSearchCV
pipe_xg = make_pipeline(
    OrdinalEncoder(),
    XGBClassifier(n_estimators=100)
dists = {
    'xgbclassifier__max_depth' : [5,6,7,8,9,11],
    'xgbclassifier__learning_rate' : [0.1,0.2,0.3],
}

clf = RandomizedSearchCV(
    pipe_xg,
    param_distributions = dists,
    n_iter = 5,
    cv = 3,
    scoring = 'roc_auc'
    )
    
clf.fit(X_train, y_train)

scoring에 원하는 metrics를 넣어주면 된다.

roc_auc란

Receiver operating characteristic - Wikipedia

Diagnostic plot Terminology and derivationsfrom a confusion matrix condition positive (P) the number of real positive cases in the data condition negative (N) the number of real negative cases in the data true positive (TP) eqv. with hit true negative (TN)

en.wikipedia.org

roc curve의 area under the curve이다. 숫자가 클수록 예측을 더 잘하는 모델이다.

타깃이 되는 특성의 클래스 레이블 분포가 불균형할 때, accuracy의 단점을 보완하면서, decision boundary에 덜 민감하게 안정적으로 label을 더 잘 분류-예측할 수 있다는 장점이 있다.

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
model().fit(X, y)
roc_auc_score(y, model.predict_proba(X)[:, 1]) # 타깃의 2번째 클래스 예측

로도 구할 수 있다.

저작자표시 비영리 변경금지

'머신러닝, 딥러닝' 카테고리의 다른 글

초등학생도 이해하는 역전파 (0)	2021.04.09
clf = RandomForestClassifier() 같이 변수를 설정해주는 이유 (0)	2021.03.01
Permutation Importances (0)	2021.02.18
multi-label 분류 문제에서 f1 score는 못 사용하는 걸까? (3)	2021.02.16
n_jobs = -1 ? 2 ? 4? 뭘 넣지? (3)	2021.02.14

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

천천히, 그러나 꾸준히

모델의 score - accuracy, f1, rmse, roc_auc, 뭐가 디폴트지?

roc_auc란

'머신러닝, 딥러닝' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

모델의 score - accuracy, f1, rmse, roc_auc, 뭐가 디폴트지?

roc_auc란

'머신러닝, 딥러닝' 카테고리의 다른 글

'머신러닝, 딥러닝' Related Articles

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역