1. So far

지금까지 모델을 평가할 때, Hold-out Cross Validation을 적용하여 훈련데이터와 평가 데이터를 나누었다. 이 방식의 문제점은 나누어진 평가 데이터는 훈련에 사용되지 못해 모델에 필요한 데이터셋의 손실이 발생하게 된다.

2. What is cross-validation ?

교차 타당도(Corss-Validation)는 수학, 통계학, 과학 분야에서 동일한 모집단에서 추출한 독립적인 두 표본 집단의 예언 변인과 기준 변인의 관계가 일관성을 유지하는 정도이다. 데이터셋을 5개로 나누게 된다면 1개는 평가 데이터로, 4개는 훈련 데이터로 나누어 모델을 학습 및 평가하게 된다. 기존의 Hold-Out CV는 한번 훈련하고 끝났다면, K-Fold는 평가 데이터를 여러 번 옮겨 5번 학습하여 평균을 구하게 된다. 즉, 모든 데이터가 학습에 참여하게 된다.

Experiment	Datasets1	Datasets2	Datasets3
Ex1	Validation
Ex2		Validation
Ex3			Validation

첫 번째 학습에서 Datasets1을 평가 데이터로 설정하고 나머지 데이터를 훈련 데이터로 지정하여 학습한다.
두 분째 시행에서 Datasets2를 평가 데이터로 설정하고 나머지 데이터를 훈련 데이터로 지정하여 학습한다.
이 과정을 5번 반복하여 각각의 시행에 대한 예측 정확도를 결과로 반환한다.

3. When should we use cross-validation ?

CV는 모델의 정확도에 대한 정확한 측도를 제공한다. K-Fold CV는 측정하는데 좀 더 시간이 걸린다.

For small datasets, where extra commputational burden isn't a big deal
For larger datasets, a single validation set is sufficient. Your code will run faster, and you may have enough data that there's little need to re-use some of it for holdout

# Progress of Machine Learning 

# Preprocessing 

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data

data = pd.read_csv('../../KAGGLE/Kaggle_House_Price/train.csv')

# Separate target from predictors

y = data.SalePrice
X = data.drop(['SalePrice'], axis = 1)

# Preprocessing for numerical data

numerical_transformer = SimpleImputer(strategy = 'constant')

# Preprocessing for categorical data

categorical_transformer = Pipeline(steps = [
    ('imputer', SimpleImputer(strategy = 'most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown = 'ignore'))
])

# Bundle preprocessing for numercal and categorical data

preprocessor = ColumnTransformer(
    transformers = [
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Built pipeline 

my_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                                ('model', RandomForestRegressor(n_estimators = 100, random_state = 0))])

from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates negative MAE

scores = -1 * cross_val_score(my_pipeline, X, y, cv = 5, scoring = 'neg_mean_absolute_error')

print("MAE scores : \n", scores)

현재 cross_val_score의 scoring은 MAE가 아닌 negative mean absolute error을 사용하고 있다.

4. Exercise : Cross-Validation

Step 1 : Write a useful function

def get_score(n_estimators):
    """Return the average MAE over 3 CV folds of random forest model.

    Keyword argument:
    n_estimators -- the number of trees in the forest
    """
    # Replace this body with your own code
    my_pipeline = Pipeline(steps=[
    ('preprocessor', SimpleImputer()),
    ('model', RandomForestRegressor(n_estimators= n_estimators, random_state=0))
    ])

    scores = -1 * cross_val_score(my_pipeline, X, y, cv = 3, scoring = 'neg_mean_absolute_error')
    return scores.mean()

Step 2 : Test differrent parameter values

results = {estimator : get_score(estimator) for estimator in np.arange(50, 450, 50)}

import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(list(results.keys()), list(results.values()))
plt.show()

Step 3 : Find the best parameter value

n_estimators_best = 200

Source of the course : Kaggle Course _ Cross-Validation

'Course > [Kaggle] Data Science' 카테고리의 다른 글

[ML] Data Leakage (0)	2022.02.19
[ML] XGBoost (0)	2022.02.19
[ML] Pipelines (0)	2022.02.19
[ML] Categorical Variables (0)	2022.02.19
[ML] Missing Values (0)	2022.02.19

[ML] Cross-Validation

1. So far

2. What is cross-validation ?

3. When should we use cross-validation ?

4. Exercise : Cross-Validation

'Course > [Kaggle] Data Science' 카테고리의 다른 글

티스토리툴바