1. Introuduction

주성분 분석(Principal Component Analysis; PCA)는 군집화와 비슷하게 우선순위에 따라 데이터를 분할하는 방식이다. 다만 데이터 사이의 관계를 발견하는데 좀 더 초점을 두고 있으며 공분산에 따라 데이터를 축약하게 된다.

2. Principal Component Analysis

PCA는 고차원의 데이터를 저차원의 데이터로 축소시키는 차원 축소 방법중 하나로 훈련 정보를 최대한 유지하면서 변수의 개수를 줄이는 방법이다. PCA를 통해 시각화, 노이즈 제거, 모델 성능 향상을 유도할 수 있다.

https://bkshin.tistory.com/entry/%EB%A8%B8%EC%8B%A0%EB%9F%AC%EB%8B%9D-9-PCA-Principal-Components-Analysis

머신러닝 - 9. 차원 축소와 PCA (Principal Components Analysis)

차원 축소와 PCA 차원 축소는 많은 feature로 구성된 다차원 데이터 세트의 차원을 축소해 새로운 차원의 데이터 세트를 생성하는 것입니다. 일반적으로 차원이 증가할수록, 즉 feature가 많아질수록

bkshin.tistory.com

강의의 예시에서 'Height'와 'Diameter'의 값은 'Size'와 'Shape' 특성으로 선형결합 되어있음을 알 수 있다.

df['Size'] = 0.707 * X['Height'] + 0.707 * X['Diameter']
df['Shape'] = 0.707 * X['Height'] + 0.707 * X['Diameter']

이 새로운 변수들은 데이터의 주성분이라고 불리며 이 가중치들은 loading이라고 한다.

Features \ Components	Size(PC1)	Shape(PC2)
Height	0.707	0.707
Diameter	0.707	-0.707

주 성분분석의 성분을 결정하는 방법은 누적 기여율에 대해서 85%의 이상의 값을 보이는 성분까지를 선택하게 된다.

3. PCA for Feature Engineering

There are two ways you could use PCA for feature engineering.

The first way is to use it as a descriptive technique. Since the components tell you about the variation, you could compute the MI scores for the components and see what kind of variation is most predictive of your target. That could give you ideas for kinds of feature to create -- a product of 'Height' and 'Diameter' if 'Size' is important, say, or a ratio of 'Heigh' and 'Diameter' if Shape is important. You could even try clustering on one or more of the high-scoring components.

The second way is to use the components themselves as features. Because the components expose the variation structure of the data directly, they can often be more informative than the original features. Here are some use-cases:

Dimensionality reduction
Anomaly detection
Noise reduction
Decorrelation

3.1 PCA Best Practices

PCA는 연속형 변수에 대해서 적용이 가능하다
스케일링에 민감하다
이상값제거를 해야한다

 features = ['highway_mpg', 'engine_size', 'horsepower', 'curb_weight']

X = df.copy()
y = X.pop('price')

X = X.loc[:, features]

# Standardize 

X_scaled = (X-X.mean(axis = 0)) / X.std(axis = 0)

# Create principal components

from sklearn.decomposition import PCA

pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Convert to dataframe

component_names = [f"PC{i+1}" for i in range(X_pcc.shape[1])]
X_pca = pd.DataFrame(X_pca, columns = component_names)

# Shows relationships among features

loadings = pd.DataFrame(
    pca.components_.T,  # transpose the matrix of loadings
    columns=component_names,  # so the columns are the principal components
    index=X.columns,  # and the rows are the original features
)

loadings

After fitting, the PCA instance contains the loadings in its components_attribute.

4. Exercise : Principal Component Analysis

Step 1 : Create New Features

X = df.copy()
y = X.pop("SalePrice")

# YOUR CODE HERE: Add new features to X.
X = X.join(X_pca)

score = score_dataset(X, y)
print(f"Your score: {score:.5f} RMSLE")

Source of the course : Kaggle Course _ Principal Component Analysis

'Course > [Kaggle] Data Science' 카테고리의 다른 글

[SQL] Getting Started with SQL and BigQuery (0)	2022.02.23
[FE] Target Encoding (0)	2022.02.21
[FE] Clustering with K-means (0)	2022.02.21
[FE] Creating Features (0)	2022.02.21
[FE] Mutual Information (0)	2022.02.21

[FE] Principal Component Analysis