1. Basic Blueprint of ML

다음은 Intro to ML 강의에서 배운 코드의 요약본이다.

# Progress of Machine Learning 

# Preprocessing 

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data

X_full = pd.read_csv('../../KAGGLE/Kaggle_House_Price/train.csv', index_col='Id')
X_test_full = pd.read_csv('../../KAGGLE/Kaggle_House_Price/test.csv', index_col='Id')

# Obtain target and predictors

y = X_full.SalePrice

features = ['LotArea', 'YearBuilt', '1stFlrSF', '2ndFlrSF', 'FullBath', 'BedroomAbvGr', 'TotRmsAbvGrd']
X = X_full[features].copy()
X_test = X_test_full[features].copy()

# Break off validation set from training data

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)

# Gridsearch      

from sklearn.ensemble import RandomForestRegressor

# Define the models

model_1 = RandomForestRegressor(n_estimators=50, random_state=0)
model_2 = RandomForestRegressor(n_estimators=100, random_state=0)
model_3 = RandomForestRegressor(n_estimators=100, criterion='mae', random_state=0)
model_4 = RandomForestRegressor(n_estimators=200, min_samples_split=20, random_state=0)
model_5 = RandomForestRegressor(n_estimators=100, max_depth=7, random_state=0)

models = [model_1, model_2, model_3, model_4, model_5]

# Make best model 

best_model = RandomForestRegressor(n_estimators=100, criterion='mae', random_state=0)

# Fit the model to the training data

best_model.fit(X, y)

# Generate test predictions

preds_test = best_model.predict(X_test)

# Save predictions in format used for competition scoring

output = pd.DataFrame({'Id': X_test.index, 'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

2. Submit result on Kaggle

Begin by clicking on the Save Version button in the top right corner of the window. This will generate a pop-up window.
Ensure that the Save and Run All option is selected, and then click on the Save button.
This generates a window in the bottom left corner of the notebook. After it has finished running, click on the number to the right of the Save Version button. This pulls up a list of versions on the right of the screen. Click on the ellipsis (...) to the right of the most recent version, and select Open in Viewer. This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
Click on the Output tab on the right of the screen. Then, click on the file you would like to submit, and click on the Submit button to submit your results to the leaderboard.

3. Missing Values

결측값에 대한 세가지 처리 방법이 있다.

결측값이 존재하는 열을 제거한다

가장 간단한 방법은 결측값이 존재하는 열을 제거하는 방법이다. 비록 결측값이 존재하는 열은 제거되지만, 모델은 많은 정보 손실이 발생할 수 있다.

Imputation

Imputation은 값을 어떤 숫자로 채워 넣는다. 예를 들어 우리는 열의 평균 값으로 결측값을 대치할 수 있다. 대치된 값은 대부분의 경우에 정확하지는 않지만 모델이 정보를 손실하지 않으면서 정확도를 향상시킬 수 있게끔 도와준다.

An Extensions to Imputation

대치는 대체적으로 잘 작동하는 접근 방식이다. 하지만 대치된 값은 실제값보다는 정확성이 떨어질 수 밖에 없다. 결측값이 존재하는 행에서 특별한 방식으로 대치할 값을 추춣할 수 있다.

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the data

data = pd.read_csv('../../KAGGLE/Kaggle_House_Price/train.csv')

# Select target

y = data.SalePrice

# To keep things simple, we'll use only numerical predictors

melb_predictors = data.drop(['SalePrice'], axis=1)
X = melb_predictors.select_dtypes(exclude=['object'])

# Divide data into training and validation subsets

X_train, X_valid, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)


from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Function for comparing different approaches

def score_dataset(X_train, X_valid, y_train, y_valid):
    model = RandomForestRegressor(n_estimators=10, random_state=0)
    model.fit(X_train, y_train)
    preds = model.predict(X_valid)
    return mean_absolute_error(y_valid, preds)

3.1 결측값 행 제거

# Drop columns 

# Get names of columns with missing values
cols_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()]

# Drop columns in training and validation data
reduced_X_train = X_train.drop(cols_with_missing, axis=1)
reduced_X_valid = X_valid.drop(cols_with_missing, axis=1)

print("MAE from Approach 1 (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

3.2 결측값 대치

from sklearn.impute import SimpleImputer

# Imputation

my_imputer = SimpleImputer()
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE from Approach 2 (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

SimpleImputer

sklearn.impute.SimpleImputer(*, missing_values = nan, strategy = 'mean', fill_value = None, verbose = 0, copy = True, add_indicator = False)
Parameter
- strategy : mean/median/most_frequent/constant
Methods
- fit() : 단순히 주어진 훈련 데이터로 모델을 훈련시킴
- fit_transform() : 모델을 훈련시키고 훈련 데이터를 변형함, fit_transform(train) -> fit(test) 순으로 적용
- get_params() : Get parameters for this estimator
- transform() : 훈련된 모델에 테스트 데이터를 적용함

# Make copy to avoid changing original data (when imputing)

X_train_plus = X_train.copy()
X_valid_plus = X_valid.copy()

# Make new columns indicating what will be imputed

for col in cols_with_missing:
    X_train_plus[col + '_was_missing'] = X_train_plus[col].isnull()
    X_valid_plus[col + '_was_missing'] = X_valid_plus[col].isnull()

# Imputation

my_imputer = SimpleImputer()
imputed_X_train_plus = pd.DataFrame(my_imputer.fit_transform(X_train_plus))
imputed_X_valid_plus = pd.DataFrame(my_imputer.transform(X_valid_plus))

# Imputation removed column names; put them back

imputed_X_train_plus.columns = X_train_plus.columns
imputed_X_valid_plus.columns = X_valid_plus.columns

print("MAE from Approach 3 (An Extension to Imputation):")
print(score_dataset(imputed_X_train_plus, imputed_X_valid_plus, y_train, y_valid))

4. Exercise : Missing Values

Step 1 : Preliminary investigation

# Shape of training data (num_rows, num_columns)
print(X_train.shape)

# Number of missing values in each column of training data
missing_val_count_by_column = (X_train.isnull().sum())
print(missing_val_count_by_column[missing_val_count_by_column > 0])

# Fill in the line below: How many rows are in the training data?
num_rows = X_train.shape[0]

# Fill in the line below: How many columns in the training data
# have missing values?
num_cols_with_missing = len([col for col in X_train.columns if X_train[col].isnull().any()])

# Fill in the line below: How many missing entries are contained in 
# all of the training data?
tot_missing = X_train.isnull().sum().sum()

Step 2 : Drop columns with missing values

# Fill in the line below: get names of columns with missing values
col_with_missing = [col for col in X_train.columns if X_train[col].isnull().any()] # Your code here

# Fill in the lines below: drop columns in training and validation data
reduced_X_train = X_train.drop(col_with_missing, axis = 1)
reduced_X_valid = X_valid.drop(col_with_missing, axis = 1)

print("MAE (Drop columns with missing values):")
print(score_dataset(reduced_X_train, reduced_X_valid, y_train, y_valid))

Step 3 : Imputation

from sklearn.impute import SimpleImputer

# Fill in the lines below: imputation
my_imputer = SimpleImputer() # Your code here
imputed_X_train = pd.DataFrame(my_imputer.fit_transform(X_train))
imputed_X_valid = pd.DataFrame(my_imputer.transform(X_valid))

# Fill in the lines below: imputation removed column names; put them back
imputed_X_train.columns = X_train.columns
imputed_X_valid.columns = X_valid.columns

print("MAE (Imputation):")
print(score_dataset(imputed_X_train, imputed_X_valid, y_train, y_valid))

Step 4 : Generate test predictions

# Preprocessed training and validation features
final_imputer = SimpleImputer(strategy = "most_frequent")
final_X_train = pd.DataFrame(final_imputer.fit_transform(X_train))
final_X_valid = pd.DataFrame(final_imputer.transform(X_valid))

final_X_train.columns = X_train.columns
final_X_valid.columns = X_valid.columns

# Define and fit model
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(final_X_train, y_train)

# Get validation predictions and MAE
preds_valid = model.predict(final_X_valid)
print("MAE (Your approach):")
print(mean_absolute_error(y_valid, preds_valid))

# Fill in the line below: preprocess test data
final_X_test = pd.DataFrame(final_imputer.transform(X_test))
final_X_test.columns = X_test.columns

# Fill in the line below: get test predictions
preds_test = model.predict(final_X_test)

Source of the course : Kaggle Course _ Missing Values

Missing Values

Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources

www.kaggle.com

'Course > [Kaggle] Data Science' 카테고리의 다른 글

[ML] Pipelines (0)	2022.02.19
[ML] Categorical Variables (0)	2022.02.19
[ML] Random Forests (0)	2022.02.14
[ML] Underfitting and Overfitting (0)	2022.02.14
[ML] Model Validation (0)	2022.02.14

[ML] Missing Values