1. What is ensemble ?

머신러닝 앙상블 기법

앙상블 학습(Ensemble Learning)은 여러개의 분류기를 생성하고, 그 예측을 결합함으로써 보다 정확한 예측을 도출하는 기법을 말한다. 강력한 하나의 모델을 사용하는 대신 보다 약한 모델 여러개를 조합하여 더 정확한 예측에 도움을 주는 방식. 데이터 오픈 플랫폼 캐글(Kaggle)에서 XGBoost, LightGBM과 같은 앙상블 알고리즘이 머신러닝의 선도 알고리즘으로 인기를 누리고 있다.

앙상블 학습 유형

보팅(Voting)
- 여러 개의 분류기가 최종 예측 결과를 결정하는 방식
- 서로 다른 알고리즘을 여러개 결합하여 사용
배깅(Bagging)
- 데이터 샘플링(Bootstrap)을 통해 모델을 학습시키고 결과를 집계하는 방법
- 모두 같은 유형의 알고리즘 기반의 분류기를 사용
- 데이터 분할시 중복을 허용
- 과적합 방지에 효과적
- 대표적으로 랜덤 포레스트 알고리즘이 있다
부스팅(Boosting)
- 여러개의 분류기가 순차적으로 학습을 진행
- 이전 분류기가 예측이 틀린 데이터에 대해서 올바르게 예측할 수 있도록 다음 분류기에게 가중치를 부여하면서 학습과 예측을 진행
- 계속하여 분류기에게 가중치를 부스팅하며 학습을 진행함
- 예측 성능이 뛰어남
- 속도가 느리고 과적합이 발생할 가능성이 존재함

2. Gradient Boosting

Gradient Boosting is a method that goes through cycles to iteratively add models into an ensemble. It begins by initializing the ensemble with a single model, whose predictions can be pretty naive. (Even if its predictions are widely inaccurate, subsequent additions to the esnemble will adress those errors).

First, we use the current ensemble to generate predictions for each observation in the dataset. To make a prediction, we add the predictions from all models in the ensemble.
These predictions are used models in the ensemble.
Then, we use the loss function to fit a new model that will be added to the ensemble. Specifically, we determine model parameters so that adding this new model to ensemble will reduce the loss.
Finally, we add the new model to ensemble, and ..
Repeat them all

3. Example

from xgboost import XGBRegressor

my_model = XGBRegressor()
my_model.fit(X_train, y_train)

from sklearn.metrics import mean_absolute_error

predictions = my_model.predict(X_valid)
print("Mean Absolute Error: " + str(mean_absolute_error(predictions, y_valid)))

4. Parameter Tuning

XGBoost has a few parameters that can dramatically affect accuracy and training speed.

n_estimators
- n_estimators specifies how many times to go through the modeling cycle described above.
- Too low a value causes underfitting
- Too high a value causes overfitting
early_stopping_rounds
- early_stopping_rounds offeres a way to automatically find the ideal value for n_estimators. Early stopping causes the model to stop iterating when the validation score steps improving, even if we aren't at the hard stop for n_estimators.
- Since random chance sometimes causes a single round where validation scores don't improve, we need to specify a number for how many rounds of straight deterioration to allow before stopping.
learning_rate
- Instead of getting predictions by simply adding up the predictions from each component model, we can multiply the predictions from each model by a small number before adding them in
- This means each tree we add to the ensemble helps us less. So, we can set a higher value for n_estimators without overfitting.
- In general, a small learniing rate and large number of estimators will yield more accurate XGBoost models, thought it will able take the model longer to train since it does more iteration through the cycle.
n_jobs
- It's common to set the parameter n_jobs equl to the number of cores on your machine.
- The resulting model won't be any better, so micro-optimizing for fitting time is typically nothing but a distraction

5. Example

my_model = XGBRegressor(n_estimators=1000, learning_rate=0.05, n_jobs=4)
my_model.fit(X_train, y_train, 
             early_stopping_rounds=5, 
             eval_set=[(X_valid, y_valid)], 
             verbose=False)

6. Exercise

Step 1 : Build model

from xgboost import XGBRegressor

# Define the model
my_model_1 = XGBRegressor(random_state = 0) # Your code here

# Fit the model
my_model_1.fit(X_train, y_train) # Your code here

from sklearn.metrics import mean_absolute_error

# Get predictions
predictions_1 = my_model_1.predict(X_valid) 

# Calculate MAE
mae_1 = mean_absolute_error(predictions_1, y_valid)

Step 2 : Improve the model

# Define the model
my_model_2 = XGBRegressor(n_estimators = 1000, learning_rate = 0.05, n_jobs = 4) # Your code here

# Fit the model
my_model_2.fit(X_train, y_train, early_stopping_rounds = 5, eval_set = [(X_valid, y_valid)], verbose = False) # Your code here

# Get predictions
predictions_2 = my_model_2.predict(X_valid) # Your code here

# Calculate MAE
mae_2 = mean_absolute_error(predictions_2, y_valid) # Your code here

# Uncomment to print MAE
print("Mean Absolute Error:" , mae_2)

Step 3 : Break the model

# Define the model
my_model_3 = XGBRegressor(n_estimators = 100, learning_rate = 1, n_jobs = 2)

# Fit the model
my_model_3.fit(X_train, y_train, early_stopping_rounds = 5, eval_set = [(X_valid, y_valid)], verbose = False) # Your code here

# Get predictions
predictions_3 = my_model_3.predict(X_valid)

# Calculate MAE
mae_3 = mean_absolute_error(predictions_3, y_valid)

# Uncomment to print MAE
print("Mean Absolute Error:" , mae_3)

Source of the course : Kaggle Course _ XGBoost

'Course > [Kaggle] Data Science' 카테고리의 다른 글

[ML] GridSearchCV (0)	2022.02.19
[ML] Data Leakage (0)	2022.02.19
[ML] Cross-Validation (0)	2022.02.19
[ML] Pipelines (0)	2022.02.19
[ML] Categorical Variables (0)	2022.02.19

[ML] XGBoost