1. What is Pipelines ?
파이프라인은 데이터 전처리와 모델링의 코드를 정돈되게 만드는 간단한 방법이다. 특히 파이프라인은 전처리와 모델링 단계를 한 다발로 묶어 하나의 단계만 존재하게끔 한다. 많은 데이터 사이언티스트들이 파이프라인없이 작업을 하는 경우가 있지만, 파이프라인은 중요한 이점을 제공한다.
- 깔끔한 코드 작성
- 적은 오류
- 생산성 향상
- 모델 평가에 도움을 줌
2. Steps of Pipelines
Step 1 : 전처리 단계를 정의한다
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy = 'constant')
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps = [
('imputer', SimpleImputer(strategy = 'most_frequent')),
('onehot', OneHotEncoder(handle_unknown = 'ignore'))
])
# Bundle preprocessing for numercal and categorical data
preprocessor = ColumnTransformer(
transformers = [
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
Pipeline 을 이용하면 변수의 종류에 따라 적용하고자 하는 Scaler, OneHotEnocder, SimpleImputer 등 을 입맛에 맞게 넣을 수 있다. Pipeline은 이중으로 작업이 가능하다. 변수별 Pipeline을 적용한 뒤 preprocessor에 묶은 뒤에 preprocessor와 model을 다시 pipeline에 넣어 더욱 간편한 코드를 생성 가능하다.
Pipeline
- sklearn.pipeline.Pipeline(steps, *, memory = None, verbose = False)
- Parameters
- steps : list of tuple
- Methods
- fit : Fit the model
- fit_predict : Transform the data, and apply fit_predict with the final estimator
- fit_transform : Fit the model and transform with the final estimator
- get_params : Get parameters for this estimator
- predict : Transform the data, and apply predict with the final estimator
- score : Transform the data, and apply score with the final estimator
>>> from sklearn.svm import SVC
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.pipeline import Pipeline
>>> X, y = make_classification(random_state=0)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
... random_state=0)
>>> pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
>>> # The pipeline can be used as any other estimator
>>> # and avoids leaking the test set into the train set
>>> pipe.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()), ('svc', SVC())])
>>> pipe.score(X_test, y_test)
0.88
Step 2 : 모델을 정의한다
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor(n_estimators = 100, random_state = 0)
Step 3 : 파이프라인을 만들고 평가한다
- 한번 파이프라인을 만들게 되면 평가 데이터와 예측 데이터에 대해서 빠른 적용이 가능하다
from sklearn.metrics import mean_absolute_error
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
('model', model)])
# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)
# Preprocessing of validation data, get predcitions
preds = my_pipeline.predict(X_valid)
# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print("MAE : ", score)
# Final process for prediction of test data
preds_test = my_pipeline.predict(X_test)
output = pd.DataFrame({'Id': X_test.index,
'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)
3. Exercise : Pipelines
Step 1 : Improve the performance
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy = 'most_frequent') # Your code here
# Preprocessing for categorical data
categorical_transformer = Pipeline(steps = [
('imputer', SimpleImputer(strategy = 'most_frequent')),
('onehot', OneHotEncoder(handle_unknown = 'ignore'))
])# Your code here
# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# Define model
model = RandomForestRegressor(n_estimators = 100, random_state = 0) # Your code here
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
('model', model)
])
# Preprocessing of training data, fit model
my_pipeline.fit(X_train, y_train)
# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)
# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)
Step 2 : Generate test predictions
# Preprocessing of test data, fit model
preds_test = my_pipeline.predict(X_test)
# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)
Source of the course : Kaggle Course _ Pipelines
'Course > [Kaggle] Data Science' 카테고리의 다른 글
[ML] XGBoost (0) | 2022.02.19 |
---|---|
[ML] Cross-Validation (0) | 2022.02.19 |
[ML] Categorical Variables (0) | 2022.02.19 |
[ML] Missing Values (0) | 2022.02.19 |
[ML] Random Forests (0) | 2022.02.14 |