K-Means Clustering Algorithm

K-Means Clustering Algorithm

K-Means Clustering Algorithm

🧠 What is K-Means Clustering?

In data science and machine learning, K-Means Clustering is a potent unsupervised learning technique that groups unlabelled data into meaningful clusters. It’s a go-to method when you want to discover patterns in your data without having prior labels.

At its core, the algorithm aims to partition data into K clusters, where each data point belongs to the cluster with the nearest centroid, helping to uncover hidden groupings in data.

Example: If K=3, the algorithm creates three clusters and assigns each data point to one of these clusters based on similarity.

Complete Python Course with Advance topics:-Click Here
SQL Tutorial :-Click Here

🛠️ How Does the K-Means Algorithm Work?

K-Means is an iterative algorithm, and its process can be broken down into the following steps:

  1. Choose K (number of clusters).
  2. Initialize K centroids randomly (can be points in or outside the dataset).
  3. Each point should be assigned to the closest centroid.
  4. Determine the mean of the points in each cluster to recalculate the centroids.
  5. Repeat steps 3 and 4 until the centroids stop changing (convergence).

📍Goal: Minimize the sum of squared distances between points and their corresponding cluster centroids.

🎯 Visualizing the Steps

Let’s say we have two variables M1 and M2 plotted on a scatter plot. If we set K=2, we randomly choose two centroids and assign points based on their proximity. We then:

  • Calculate new centroids for each cluster.
  • Reassign data points.
  • Repeat until no further changes occur.

Eventually, we get well-separated clusters where intra-cluster similarities are high and inter-cluster differences are clear.

❓How to Choose the Right Value of K?

It’s crucial to select the appropriate number of clusters (K). One of the most popular techniques to determine the optimal value of K is:

🔍 Elbow Method

The Elbow Method involves:

  1. Running K-Means for K = 1 to 10.
  2. Calculating WCSS (Within-Cluster Sum of Squares) for each K.
  3. Plotting the WCSS values against K.
  4. locating the “elbow”—the point at where the curvature bends abruptly.

📉 That “elbow” point gives us the optimal K.

WCSS Formula:
WCSS = Σ (distance of each point from its cluster centroid)²

🐍 Python Implementation of K-Means Clustering

Let’s walk through a practical implementation using Python. We’ll use the Mall Customers dataset, which includes customer data like age, income, and spending score.

📦 Step 1: Data Preprocessing

# Importing necessary libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Loading the dataset
dataset = pd.read_csv('Mall_Customers_data.csv')

# Selecting features (Annual Income and Spending Score)
x = dataset.iloc[:, [3, 4]].values

📈 Step 2: Finding Optimal K using Elbow Method

from sklearn.cluster import KMeans

wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(x)
    wcss.append(kmeans.inertia_)

# Plotting the results
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()

You can clearly identify your ideal K, or the “elbow” point, on the graph.

🚀 Step 3: Applying K-Means to the Dataset

# Let's say the optimal K is 5
kmeans = KMeans(n_clusters=5, init='k-means++', random_state=42)
y_kmeans = kmeans.fit_predict(x)

🎨 Step 4: Visualizing the Clusters

# Visualizing the clusters
plt.scatter(x[y_kmeans == 0, 0], x[y_kmeans == 0, 1], s=100, c='red', label='Cluster 1')
plt.scatter(x[y_kmeans == 1, 0], x[y_kmeans == 1, 1], s=100, c='blue', label='Cluster 2')
plt.scatter(x[y_kmeans == 2, 0], x[y_kmeans == 2, 1], s=100, c='green', label='Cluster 3')
plt.scatter(x[y_kmeans == 3, 0], x[y_kmeans == 3, 1], s=100, c='cyan', label='Cluster 4')
plt.scatter(x[y_kmeans == 4, 0], x[y_kmeans == 4, 1], s=100, c='magenta', label='Cluster 5')

# Plotting centroids
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
            s=300, c='yellow', label='Centroids')

plt.title('Clusters of mall customers')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1–100)')
plt.legend()
plt.show()

🧠 Summary

  • A straightforward yet effective clustering approach is K-Means.
  • Works well when groups in the data are distinct and well-separated.
  • Scales well to large datasets.
  • To determine the ideal number of clusters, apply techniques such as the Elbow Method.
  • K-Means assumes clusters are spherical and evenly sized, so it may not perform well on more complex shapes or distributions.

Download New Real Time Projects :-Click here
Complete Advance AI topics:- CLICK HERE

🧩 Final Thoughts

K-Means Clustering is one of the first tools data scientists reach for when uncovering patterns in unlabeled data. Whether it’s customer segmentation, image compression, or market research, K-Means offers a clean and efficient way to group similar items and gain insights.


k-means clustering example
k-means clustering algorithm in machine learning
k-means clustering solved example
k-means clustering algorithm python
k-means clustering algorithm in data mining
k-means clustering algorithm numerical example
k-means clustering formula
k-medoids clustering

Share this content:

Post Comment