1. What is Categorical Variable
범주형 변수(Categorical Variables)는 제한된 수의 값을 가진다. 예를들어 아침식사를 얼마나 자주 먹느냐는 질문에 "Never", 'Rarely", "Most days", "Everyday"라는 질문으로 답을하게 된다면, 이 대답의 데이터 셋이 범주형 변수가 된다.
2. Categorical Variables
범주형 변수를 처리하는 방법은 4가지가 있다.
- 범주형 변수를 제거
가장 간단한 범주형 변수 처리 방법은 데이터 셋에서 제거하는 방법이다. 이 방식은 결측값 처리와 마찬가지로 좋은 정보를 잃어버릴 가능성이 높다.
- Ordinal Encoding
Ordinal encoding 은 각각의 값을 특정한 값으로 할당한다. 이 접근법은 순서형 카테고리로 값을 할당하게 된다. One-Hot Encoding과는 다르게 생성되는 column양이 적으며 순서형 변수에 대해 사용하게 된다.
- One-Hot Encoding
One-Hot encoding 는 원래 데이터에서 존재하는 값들에 대해 새로운 열을 생성한다. 예를들어 "Color"변수에 "Red", "Yello", "Green"의 카테고리를 가질 경우에 One-Hot Encoding을 적용하게 되면, 카테고리의 값을 이름으로한 열이 생성하고 각 행에 대해 색깔이 "Red"인 경우 해당 열에 1을 입력하고, 나머지 열에는 0을 입력한다.
- Pandas get_dummies()
One-Hot Encoder와 비슷하게 적용하지만, column명을 지정해주는 방식이 cat_col_1, cat_col_2, .. 형태로 다르다.
# Prepare for processing
import pandas as pd
from sklearn.model_selection import train_test_split
# Load the data
data = pd.read_csv('../../KAGGLE/Kaggle_House_Price/train.csv')
# Separate target from predictors
y = data.SalePrice
X = data.drop(['SalePrice'], axis = 1)
# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2, random_state=0)
# Drop columns with missing values (simplest approach)
cols_with_missing = [col for col in X_train_full.columns if X_train_full[col].isnull().any()]
X_train_full.drop(cols_with_missing, axis=1, inplace=True)
X_valid_full.drop(cols_with_missing, axis=1, inplace=True)
# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
low_cardinality_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 10 and
X_train_full[cname].dtype == "object"]
# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]
# Keep selected columns only
my_cols = low_cardinality_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
# Get list of categorical variables
s = (X_train.dtypes == 'object')
object_cols = list(s[s].index)
print("Categorical variables:")
print(object_cols)
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
# Function for comparing different approaches
def score_dataset(X_train, X_valid, y_train, y_valid):
model = RandomForestRegressor(n_estimators=100, random_state=0)
model.fit(X_train, y_train)
preds = model.predict(X_valid)
return mean_absolute_error(y_valid, preds)
2.1 범주형 변수 제거
# Score from Approach 1 (Drop Categorical Variables)
drop_X_train = X_train.select_dtypes(exclude=['object'])
drop_X_valid = X_valid.select_dtypes(exclude=['object'])
print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
# The result will be 17952.5914...
2.2 Ordinal Encoder
OrdinalEncoder
- sklearn.preprocessing.OrdinalEncoder(*, categories = 'auto', dtype = <class 'numpy.float64'>, handle_unkown = 'error', unknown_value = None)
- Methods
- fit() : 단순히 주어진 훈련 데이터로 모델을 훈련시킴
- fit_transform() : 모델을 훈련시키고 훈련 데이터를 변형함, fit_transform(train) -> fit(test) 순으로 적용
- get_params() : Get parameters for this estimator
- transform() : 훈련된 모델에 테스트 데이터를 적용함
# Score from Approach 2 (Ordinary Encoding)
from sklearn.preprocessing import OrdinalEncoder
# Make copy to avoid changing original data
label_X_train = X_train.copy()
label_X_valid = X_valid.copy()
# Apply ordinal encoder to each column with categorical data
ordinal_encoder = OrdinalEncoder()
label_X_train[object_cols] = ordinal_encoder.fit_transform(X_train[object_cols])
label_X_valid[object_cols] = ordinal_encoder.transform(X_valid[object_cols])
print("MAE from Approach 2 (Ordinal Encoding):")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))
2.3 One-Hot Encoding
OneHotEncoder
- sklearn.preprocessing.OrdinalEncoder(*, categories = 'auto', drop = None, sparse = True, dtype = <class 'numpy.float64'>, handle_unkown = 'error')
- Methods
- fit() : 단순히 주어진 훈련 데이터로 모델을 훈련시킴
- fit_transform() : 모델을 훈련시키고 훈련 데이터를 변형함, fit_transform(train) -> fit(test) 순으로 적용
- get_params() : Get parameters for this estimator
- transform() : 훈련된 모델에 테스트 데이터를 적용함
from sklearn.preprocessing import OneHotEncoder
# Apply one-hot encoder to each column with categorical data
OH_encoder = OneHotEncoder(handle_unknown='ignore', sparse=False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[object_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[object_cols]))
# One-hot encoding removed index; put it back
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
# Remove categorical columns (will replace with one-hot encoding)
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
# Add one-hot encoded columns to numerical features
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
print("MAE from Approach 3 (One-Hot Encoding):")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))
3. Exercise : Categorical Variables
Step 1 : Drop columns with categorical data
# Fill in the lines below: drop columns in training and validation data
drop_X_train = X_train.select_dtypes(exclude = ['object'])
drop_X_valid = X_valid.select_dtypes(exclude = ['object'])
print("MAE from Approach 1 (Drop categorical variables):")
print(score_dataset(drop_X_train, drop_X_valid, y_train, y_valid))
print("Unique values in 'Condition2' column in training data:", X_train['Condition2'].unique())
print("\nUnique values in 'Condition2' column in validation data:", X_valid['Condition2'].unique())
Step 2 : Ordinal encoding
from sklearn.preprocessing import OrdinalEncoder
# Drop categorical columns that will not be encoded
label_X_train = X_train.drop(bad_label_cols, axis=1)
label_X_valid = X_valid.drop(bad_label_cols, axis=1)
# Apply ordinal encoder
ordinary_encoder = OrdinalEncoder() # Your code here
label_X_train[good_label_cols] = ordinary_encoder.fit_transform(label_X_train[good_label_cols])
label_X_valid[good_label_cols] = ordinary_encoder.transform(label_X_valid[good_label_cols])
print("MAE from Approach 2 (Ordinal Encoding):")
print(score_dataset(label_X_train, label_X_valid, y_train, y_valid))
Step 3 : Investigating cardinality
# Fill in the line below: How many categorical variables in the training data
# have cardinality greater than 10?
high_cardinality_numcols = len([fea for fea in X_train.columns if X_train[fea].nunique() > 10 and X_train[fea].dtype == 'object'])
# Fill in the line below: How many columns are needed to one-hot encode the
# 'Neighborhood' variable in the training data?
num_cols_neighborhood = 25
# Fill in the line below: How many entries are added to the dataset by
# replacing the column with a one-hot encoding?
OH_entries_added = 1e4*100 - 1e4
# Fill in the line below: How many entries are added to the dataset by
# replacing the column with an ordinal encoding?
label_entries_added = 0
# Columns that will be one-hot encoded
low_cardinality_cols = [col for col in object_cols if X_train[col].nunique() < 10]
# Columns that will be dropped from the dataset
high_cardinality_cols = list(set(object_cols)-set(low_cardinality_cols))
print('Categorical columns that will be one-hot encoded:', low_cardinality_cols)
print('\nCategorical columns that will be dropped from the dataset:', high_cardinality_cols)
from sklearn.preprocessing import OneHotEncoder
# Use as many lines of code as you need!
OH_encoder = OneHotEncoder(handle_unknown = 'ignore', sparse = False)
OH_cols_train = pd.DataFrame(OH_encoder.fit_transform(X_train[low_cardinality_cols]))
OH_cols_valid = pd.DataFrame(OH_encoder.transform(X_valid[low_cardinality_cols]))
OH_cols_train.index = X_train.index
OH_cols_valid.index = X_valid.index
num_X_train = X_train.drop(object_cols, axis=1)
num_X_valid = X_valid.drop(object_cols, axis=1)
OH_X_train = pd.concat([num_X_train, OH_cols_train], axis=1)
OH_X_valid = pd.concat([num_X_valid, OH_cols_valid], axis=1)
print("MAE from Approach 3 (One-Hot Encoding):")
print(score_dataset(OH_X_train, OH_X_valid, y_train, y_valid))
Step 4 : Generate test predictions and submit your results
# (Optional) Your code here
X_test = X_test.fillna(method = 'ffill')
OH_cols_test = pd.DataFrame(OH_encoder.transform(X_test[low_cardinality_cols]))
OH_cols_test.index = X_test.index
num_X_test = X_test.drop(object_cols, axis=1)
OH_X_test = pd.concat([num_X_test, OH_cols_test], axis=1)
final_model = RandomForestRegressor(n_estimators=100, random_state=0)
final_model.fit(OH_X_train, y_train)
preds_test = final_model.predict(OH_X_test)
output = pd.DataFrame({'Id': OH_X_test.index,
'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)
Source of the course : Kaggle Course _ Categorical Variables
Categorical Variables
Explore and run machine learning code with Kaggle Notebooks | Using data from multiple data sources
www.kaggle.com
'Course > [Kaggle] Data Science' 카테고리의 다른 글
[ML] Cross-Validation (0) | 2022.02.19 |
---|---|
[ML] Pipelines (0) | 2022.02.19 |
[ML] Missing Values (0) | 2022.02.19 |
[ML] Random Forests (0) | 2022.02.14 |
[ML] Underfitting and Overfitting (0) | 2022.02.14 |