1. What is Pipelines ?

파이프라인은 데이터 전처리와 모델링의 코드를 정돈되게 만드는 간단한 방법이다. 특히 파이프라인은 전처리와 모델링 단계를 한 다발로 묶어 하나의 단계만 존재하게끔 한다. 많은 데이터 사이언티스트들이 파이프라인없이 작업을 하는 경우가 있지만, 파이프라인은 중요한 이점을 제공한다.

깔끔한 코드 작성
적은 오류
생산성 향상
모델 평가에 도움을 줌

2. Steps of Pipelines

Step 1 : 전처리 단계를 정의한다

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data

numerical_transformer = SimpleImputer(strategy = 'constant')

# Preprocessing for categorical data

categorical_transformer = Pipeline(steps = [
    ('imputer', SimpleImputer(strategy = 'most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown = 'ignore'))
])

# Bundle preprocessing for numercal and categorical data

preprocessor = ColumnTransformer(
    transformers = [
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

Pipeline 을 이용하면 변수의 종류에 따라 적용하고자 하는 Scaler, OneHotEnocder, SimpleImputer 등 을 입맛에 맞게 넣을 수 있다. Pipeline은 이중으로 작업이 가능하다. 변수별 Pipeline을 적용한 뒤 preprocessor에 묶은 뒤에 preprocessor와 model을 다시 pipeline에 넣어 더욱 간편한 코드를 생성 가능하다.

Pipeline

sklearn.pipeline.Pipeline(steps, *, memory = None, verbose = False)
Parameters
- steps : list of tuple
Methods
- fit : Fit the model
- fit_predict : Transform the data, and apply fit_predict with the final estimator
- fit_transform : Fit the model and transform with the final estimator
- get_params : Get parameters for this estimator
- predict : Transform the data, and apply predict with the final estimator
- score : Transform the data, and apply score with the final estimator

>>> from sklearn.svm import SVC
>>> from sklearn.preprocessing import StandardScaler
>>> from sklearn.datasets import make_classification
>>> from sklearn.model_selection import train_test_split
>>> from sklearn.pipeline import Pipeline
>>> X, y = make_classification(random_state=0)
>>> X_train, X_test, y_train, y_test = train_test_split(X, y,
...                                                     random_state=0)
>>> pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
>>> # The pipeline can be used as any other estimator
>>> # and avoids leaking the test set into the train set
>>> pipe.fit(X_train, y_train)
Pipeline(steps=[('scaler', StandardScaler()), ('svc', SVC())])
>>> pipe.score(X_test, y_test)
0.88

Step 2 : 모델을 정의한다

from sklearn.ensemble import RandomForestRegressor

model = RandomForestRegressor(n_estimators = 100, random_state = 0)

Step 3 : 파이프라인을 만들고 평가한다

한번 파이프라인을 만들게 되면 평가 데이터와 예측 데이터에 대해서 빠른 적용이 가능하다

from sklearn.metrics import mean_absolute_error

# Bundle preprocessing and modeling code in a pipeline

my_pipeline = Pipeline(steps = [('preprocessor', preprocessor),
                               ('model', model)])

# Preprocessing of training data, fit model

my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predcitions

preds = my_pipeline.predict(X_valid)

# Evaluate the model

score = mean_absolute_error(y_valid, preds)
print("MAE : ", score)

# Final process for prediction of test data 

preds_test = my_pipeline.predict(X_test)

output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

3. Exercise : Pipelines

Step 1 : Improve the performance

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy = 'most_frequent') # Your code here

# Preprocessing for categorical data
categorical_transformer =  Pipeline(steps = [
    ('imputer', SimpleImputer(strategy = 'most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown = 'ignore'))
])# Your code here

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
model = RandomForestRegressor(n_estimators = 100, random_state = 0) # Your code here

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

Step 2 : Generate test predictions

# Preprocessing of test data, fit model
preds_test = my_pipeline.predict(X_test)

# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

Source of the course : Kaggle Course _ Pipelines

Pipelines

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

'Course > [Kaggle] Data Science' 카테고리의 다른 글

[ML] XGBoost (0)	2022.02.19
[ML] Cross-Validation (0)	2022.02.19
[ML] Categorical Variables (0)	2022.02.19
[ML] Missing Values (0)	2022.02.19
[ML] Random Forests (0)	2022.02.14

[ML] Pipelines

1. What is Pipelines ?

2. Steps of Pipelines

3. Exercise : Pipelines

'Course > [Kaggle] Data Science' 카테고리의 다른 글

티스토리툴바