Learn

[데이터처리와 분석] K-Mean 실습 및 확인

부루기 2024. 6. 13. 09:18

728x90

K-means_DA

[데이터처리와 분석] K-Mean

K-means는 비지도 학습(Unsupervised Learning) 알고리즘 중 하나로, 주로 데이터 군집화(Clustering)를 수행하는 데 사용됩니다. 이 알고리즘의 목적은 데이터를 K개의 군집으로 나누고, 각 군집 내의 데이터 포인트가 군집의 중심에서 가능한 한 가까운 위치에 있도록 하는 것입니다.

1. K-means 알고리즘의 주요 단계¶

1-1. 초기화:¶

군집의 수 K를 설정합니다. 데이터셋에서 무작위로 K개의 중심(centroids)을 초기화합니다.

1-2. 군집 할당:¶

각 데이터 포인트를 가장 가까운 중심(centroid)에 할당하여 K개의 군집을 형성합니다. 이는 각 데이터 포인트와 각 중심 사이의 거리를 계산하고, 가장 가까운 중심에 데이터 포인트를 할당함으로써 이루어집니다. 중심 업데이트:

각 군집의 중심을 해당 군집 내의 데이터 포인트들의 평균으로 업데이트합니다.

1-3. 반복:¶

군집 할당과 중심 업데이트 단계를 반복하여, 군집의 할당이 더 이상 변화하지 않을 때까지(또는 중심의 이동이 매우 작을 때까지) 계속합니다.

알고리즘의 종료 조건

각 데이터 포인트의 군집 할당이 더 이상 변하지 않을 때.
중심의 위치 변화가 매우 작을 때 (일반적으로 사전에 정의된 임계값 이하일 때).
최대 반복 횟수에 도달했을 때.

실제 실행¶

1. 데이터 로드 및 전처리¶

로드, 산점도 확인, 정규화

정규화

from sklearn import preprocessing

X_train_norm = preprocessing.normalize(X_train)
X_test_norm = preprocessing.normalize(X_test)

2. 학습¶

2-1. K-mean 학습¶

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters = 3, random_state = 0).fit(X_train_norm)

sns.scatterplot(data = X_train, x = 'longitude', y = 'latitude', hue = kmeans.labels_)

데이터 확인

` sns.boxplot(x = kmeans.labels_, y = y_train['median_house_value'])

3. 평가¶

3-1. 실루엣 점수¶

실루엣 점수가 뭔가요?

실루엣 점수(Silhouette Score)는 군집화의 품질을 평가하는 지표로, 각 데이터 포인트가 얼마나 잘 군집화되었는지 나타내는 척도입니다. 실루엣 점수는 각 데이터 포인트의 군집 내 응집력(cohesion)과 다른 군집과의 분리도(separation)를 동시에 고려하여 계산됩니다.

실루엣 점수의 계산¶

실루엣 점수 s는 각 데이터 포인트 i에 대해 다음과 같이 계산됩니다:

응집도 a(i):

a(i)는 데이터 포인트 i와 같은 군집 내 다른 모든 데이터 포인트들 간의 평균 거리입니다.
즉, 포인트 i가 속한 군집 내의 응집도를 측정합니다.

분리도 b(i):

b(i)는 데이터 포인트 i와 가장 가까운 다른 군집의 데이터 포인트들 간의 평균 거리입니다.
즉, 포인트 i가 속하지 않은 가장 가까운 군집과의 거리를 측정합니다.

실루엣 점수 s(i):

각 데이터 포인트 i의 실루엣 점수는 다음과 같이 계산됩니다:

실루엣 점수의 범위는 -1에서 1 사이입니다.

1에 가까울수록 해당 데이터 포인트가 잘 군집화되었음을 의미합니다.
0에 가까우면 데이터 포인트가 군집 경계에 위치해 있음을 의미합니다.
-1에 가까우면 데이터 포인트가 잘못 군집화되었음을 의미합니다.

from sklearn.metrics import silhouette_score

silhouette_score(X_train_norm, kmeans.labels_, metric='euclidean')

4. 튜닝¶

4-1. 적절한 K 찾기¶

실루엣 점수를 전부 해서 가장 큰 실루엣 점수가 Best model으로 생각하기

K = range(2, 8)
fits = []
score = []


for k in K:
    # train the model for current value of k on training data
    model = KMeans(n_clusters = k, random_state = 0).fit(X_train_norm)

    # append the model to fits
    fits.append(model)

    # Append the silhouette score to scores
    score.append(silhouette_score(X_train_norm, model.labels_, metric='euclidean'))

확인

sns.scatterplot(data = X_train, x = 'longitude', y = 'latitude', hue = fits[0].labels_)

sns.scatterplot(data = X_train, x = 'longitude', y = 'latitude', hue = fits[3].labels_)

sns.lineplot(x = K, y = score)

In this lecture, you will learn about k-means clustering. We'll cover:

How the k-means clustering algorithm works
How to visualize data to determine if it is a good candidate for clustering-
A case study of training and tuning a k-means clustering model using a real-world California housing dataset.

Note that this should not be confused with k-nearest neighbors, and readers wanting that should go to k-Nearest Neighbors (KNN) Classification with scikit-learn in Python instead.

This is useful to know as k-means clustering is a popular clustering algorithm that does a good job of grouping spherical data together into distinct groups. This is very valuable as both an analysis tool when the groupings of rows of data are unclear or as a feature-engineering step for improving supervised learning models.

가격은 K-means랑 관계는 없음, 그냥 보여주려고 한것

GMM이 생각보다 어려움 이전에 multivariate를 하고, EM을 알아야함 -> GMM은 mixture 모델이다.

Dataset¶

We will be using California housing data from Kaggle. We will use location data (latitude and longitude) as well as the median house value. We will cluster the houses by location and observe how house prices fluctuate across California. We save the dataset as a csv file called ‘housing.csv’ in our working directory and read it using pandas.

In [ ]:

import pandas as pd

home_data = pd.read_csv('housing.csv', usecols = ['longitude', 'latitude', 'median_house_value'])
home_data.head()

Out[ ]:

	longitude	latitude	median_house_value
0	-122.23	37.88	452600.0
1	-122.22	37.86	358500.0
2	-122.24	37.85	352100.0
3	-122.25	37.85	341300.0
4	-122.25	37.85	342200.0

k-Means Clustering Workflow¶

We will focus on collecting and splitting the data (in data preparation) and hyperparameter tuning, training your model, and assessing model performance (in modeling). Much of the work involved in unsupervised learning algorithms lies in the hyperparameter tuning and assessing performance to get the best results from your model.

Visualize the Data¶

In [ ]:

import seaborn as sns

sns.scatterplot(data = home_data, x = 'longitude', y = 'latitude', hue = 'median_house_value')

Out[ ]:

<AxesSubplot:xlabel='longitude', ylabel='latitude'>

No description has been provided for this image

Normalizing the data¶

When working with distance-based algorithms, like k-Means Clustering, we must normalize the data. If we do not normalize the data, variables with different scaling will be weighted differently in the distance formula that is being optimized during training. For example, if we were to include price in the cluster, in addition to latitude and longitude, price would have an outsized impact on the optimizations because its scale is significantly larger and wider than the bounded location variables.

In [ ]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(home_data[['latitude', 'longitude']], home_data[['median_house_value']], test_size=0.33, random_state=0)

In [ ]:

from sklearn import preprocessing

X_train_norm = preprocessing.normalize(X_train)
X_test_norm = preprocessing.normalize(X_test)

In [ ]:

X_train_norm

Out[ ]:

array([[ 0.26937372, -0.96303572],
       [ 0.2762676 , -0.96108075],
       [ 0.27623513, -0.96109009],
       ...,
       [ 0.28741997, -0.95780466],
       [ 0.27416102, -0.9616838 ],
       [ 0.27304949, -0.96199999]])

Fitting and Evaluating the Model¶

For the first iteration, we will arbitrarily choose a number of clusters (referred to as k) of 3. Building and fitting models in sklearn is very simple. We will create an instance of KMeans, define the number of clusters using the n_clusters attribute, set n_init, which defines the number of iterations the algorithm will run with different centroid seeds, to “auto,” and we will set the random_state to 0 so we get the same result each time we run the code. We can then fit the model to the normalized training data using the fit() method.

In [ ]:

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters = 3, random_state = 0).fit(X_train_norm)

sns.scatterplot(data = X_train, x = 'longitude', y = 'latitude', hue = kmeans.labels_)

Out[ ]:

<AxesSubplot:xlabel='longitude', ylabel='latitude'>

We see that the data are now clearly split into 3 distinct groups (Northern California, Central California, and Southern California). We can also look at the distribution of median house prices in these 3 groups using a boxplot.

In [ ]:

sns.boxplot(x = kmeans.labels_, y = y_train['median_house_value'])

Out[ ]:

<AxesSubplot:ylabel='median_house_value'>

We clearly see that the Northern and Southern clusters have similar distributions of median house values (clusters 0 and 2) that are higher than the prices in the central cluster (cluster 1).

We can evaluate performance of the clustering algorithm using a Silhouette score which is a part of sklearn.metrics where a lower score represents a better fit.

In [ ]:

from sklearn.metrics import silhouette_score

silhouette_score(X_train_norm, kmeans.labels_, metric='euclidean')

Out[ ]:

0.7499371920703546

Choosing the best number of clusters¶

The weakness of k-means clustering is that we don’t know how many clusters we need by just running the model. We need to test ranges of values and make a decision on the best value of k. We typically make a decision using the Elbow method to determine the optimal number of clusters where we are both not overfitting the data with too many clusters, and also not underfitting with too few.

We create the below loop to test and store different model results so that we can make a decision on the best number of clusters.

In [ ]:

K = range(2, 8)
fits = []
score = []


for k in K:
    # train the model for current value of k on training data
    model = KMeans(n_clusters = k, random_state = 0).fit(X_train_norm)

    # append the model to fits
    fits.append(model)

    # Append the silhouette score to scores
    score.append(silhouette_score(X_train_norm, model.labels_, metric='euclidean'))

In [ ]:

sns.scatterplot(data = X_train, x = 'longitude', y = 'latitude', hue = fits[0].labels_)

Out[ ]:

<AxesSubplot:xlabel='longitude', ylabel='latitude'>

In [ ]:

sns.scatterplot(data = X_train, x = 'longitude', y = 'latitude', hue = fits[3].labels_)

Out[ ]:

<AxesSubplot:xlabel='longitude', ylabel='latitude'>

In [ ]:

sns.lineplot(x = K, y = score)

Out[ ]:

<AxesSubplot:>

In [ ]:

sns.boxplot(x = fits[3].labels_, y = y_train['median_house_value'])

Out[ ]:

<AxesSubplot:ylabel='median_house_value'>

In [ ]:

728x90

저작자표시 비영리 변경금지

'Learn' 카테고리의 다른 글

[데이터처리및분석] SVM 정리 및 실습 (0)	2024.06.13
[컴퓨터비전] Fast-RCNN (1)	2024.06.13
[컴퓨터비전] R-CNN (0)	2024.06.13
[머신러닝] Optimizer(SGD+M, AdaGrad, RMSProp, Adam, AdamW) (1)	2024.06.12
[데이터처리와분석] 데이터 준비하기 (0)	2024.06.12

현재글[데이터처리와 분석] K-Mean 실습 및 확인

초보개발자의 성장블로그

이제 막 언어를 배운 개발자의 성장 일기장 내 나름대로의 경험창고

프로그래밍언어책, 허민석, Git, 파이썬, 딥러닝워크북, 코딩진로, 독후감, 찰스펫졸드, 생활코딩, 영화앱, 웹크롤링, 노마드코더, 텐서플로어, 자바스크립트, Core C programming, 리엑트, 초보개발자, 정리노트, 클론코딩, 딥러닝기초,

Today :
Yesterday :

초보개발자의 성장블로그

[데이터처리와 분석] K-Mean 실습 및 확인

1. K-means 알고리즘의 주요 단계¶

1-1. 초기화:¶

1-2. 군집 할당:¶

1-3. 반복:¶

실제 실행¶

1. 데이터 로드 및 전처리¶

2. 학습¶

2-1. K-mean 학습¶

3. 평가¶

3-1. 실루엣 점수¶

실루엣 점수의 계산¶

4. 튜닝¶

4-1. 적절한 K 찾기¶

Dataset¶

k-Means Clustering Workflow¶

Visualize the Data¶

Normalizing the data¶

Fitting and Evaluating the Model¶

Choosing the best number of clusters¶

'Learn' 카테고리의 다른 글

'Learn'의 다른글

티스토리툴바

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

[데이터처리와 분석] K-Mean 실습 및 확인

1. K-means 알고리즘의 주요 단계¶

1-1. 초기화:¶

1-2. 군집 할당:¶

1-3. 반복:¶

실제 실행¶

1. 데이터 로드 및 전처리¶

2. 학습¶

2-1. K-mean 학습¶

3. 평가¶

3-1. 실루엣 점수¶

실루엣 점수의 계산¶

4. 튜닝¶

4-1. 적절한 K 찾기¶

Dataset¶

k-Means Clustering Workflow¶

Visualize the Data¶

Normalizing the data¶

Fitting and Evaluating the Model¶

Choosing the best number of clusters¶

'Learn' 카테고리의 다른 글

'Learn'의 다른글

관련글

티스토리툴바