1. What is Data Leakage ?
Data leakage (or leakage) happens when your training data contains information about the target, but similar data will not be available when the model is used for prediction. This leads to high performance on the training set (and possibly even the validation data), but the model will perform poorly in production.
In other words, leakage causes a model to look accurate until you start making decisions with the model, and then the models becomes very inaccurate. There are two main types of leakages : target leakage and train-test contamination.
2. Target leakage
Target leakage occurs when your predictors include data that will not be available at the time you make predictions. It is important to think about target leakage in terms of the timing or chronocological order that data becomes available, not merely whether a feature helps make good predictions.
got_pneumonia | age | weight | male | took_antibiotic_medicine | ... |
---|---|---|---|---|---|
False | 65 | 100 | False | False | ... |
False | 72 | 130 | True | False | ... |
True | 58 | 100 | False | True | ... |
People take anitibiotic medicinces after getting pneumonia in order to recover. The raw data shows a strong relationship between those columns, but took_antibiotic_medicine is frequently changed after the value for got_pnumonia is determined.
The model would see that anyone who has a value of Flase for took_antibiotic_medicine didn't have pneumonia. Since avlidation data comes from the same source as training data, the pattern will repeat itself in validation, and the model will have great validation score.
But the model will be very inaccurate when subsequently deployed in the real world, because even patients who will get pneunomia won't have received antibiotics yet when we need to make predictions about their future help.
To preven this type of data leakage, any variable updated after the target value is realized should be excluded.
즉, 타겟 변수 이후에 영향을 주는 변수는 필요없음.
3. Train-Test contamination
이 data leakage는 학습 데이터와 validation 데이터를 제대로 구분하지 않았을 때 생긴다. Recall that validation는 모델이 이전에 고려되지 않았던 데이터에 대해 어떻게 작동하는지 측정하는 것을 말한다. Validation 데이터가 전처리에 영향을 준다면 이 과정에 손상이 올 수도 있다. 만약 train_test_split()을 전처리 과정(missing value를 처리하는 imputer 같은) 이전에 한다고 생각해보자. 결과는? Validation 스코어는 좋겠지만 배포 후의 성능은 별로일 것이다.
만약 validation 데이터가 train-test split을 기반으로 만들어졌을 때, validation 데이터를 모든 fitting에서 제외하고, 전처리 단계의 fitting에 포함시켜야 한다. Scikit-learn의 pipelines을 이용하면 더 쉽다. Cross-validation을 사용할 때는 파이프 라인 내에서 전처리를 수행하는 것이 훨씬 더 중요하다.
즉, 데이터 전처리는 train-test split과정 이후 train에서만 적용되어야 함. Cross-validation의 경우에는 Pipeline을 생성후 전처리를 통해 수행하는게 정확함.
7. Data Leakage
이번 튜토리얼에서는 데이터 누수(Data Leakage)가 무엇이며 어떻게 방지하는지에 대해 알아본다. 만약 ...
blog.naver.com
Source of the course : Kaggle Course _ Data Leakage
'Course > [Kaggle] Data Science' 카테고리의 다른 글
[FE] Mutual Information (0) | 2022.02.21 |
---|---|
[ML] GridSearchCV (0) | 2022.02.19 |
[ML] XGBoost (0) | 2022.02.19 |
[ML] Cross-Validation (0) | 2022.02.19 |
[ML] Pipelines (0) | 2022.02.19 |