CUSTOMERSEGMENTATION
K-Means & Hierarchical clustering on 200 mall customers. Two algorithms converge on five segments with 94.2% agreement.
5
Segments
0.942
ARI
$15k–$137k
Income Range
Exploring the data
The dataset is 200 mall customers from Kaggle with five columns: CustomerID, Gender, Age, Annual Income (k$), and Spending Score (1-100). The spending score is assigned by the mall based on purchase frequency and basket size.
Looking at the raw data, two features immediately stand out as useful for behavioral segmentation: Annual Income and Spending Score. Age and Gender tell us who the customers are, but not how they behave, so they're used for profiling after clustering, not as inputs.
Plotting income vs. spending with no labels at all reveals obvious structure: five blobs that practically label themselves. This is the kind of dataset where clustering will work well: the groups are visually separable before any algorithm touches it.
Cluster Assignments
Why these two methods
We don't have labels. Nobody told us how many groups there are or which customers belong to which. That makes this an unsupervised problem, and clustering is the natural tool. But which algorithm?
K-Means is the obvious first choice: fast, simple, and works well on convex clusters. But it requires you to pick K upfront, and the result depends on random initialization. Running a second algorithm as a sanity check is good practice.
Hierarchical clustering (Ward linkage) builds a tree of merges from the bottom up with no random seed and no K required. If two fundamentally different algorithms agree on the same segments, we can be confident the structure is real.
Only two features are used. With just income and spending, Euclidean distance is meaningful and interpretable. Adding age or gender would require careful thought about feature scaling and whether those dimensions actually help separate spending behavior (they don't, in this dataset). We standardize both features with StandardScaler so income ($15k–$137k) and spending score (1–99) contribute equally.
Distance & Objective
K-Means relies on Euclidean distance to assign points to the nearest centroid, and minimizes the total within-cluster sum of squares (WCSS) to find compact groupings.
Euclidean Distance
The straight-line distance between two points. K-Means uses this to assign each customer to the nearest centroid. With StandardScaler applied, both features contribute equally regardless of their original scale.
Within-Cluster Sum of Squares (WCSS)
The total variance within all clusters. K-Means iteratively minimizes this by updating centroids and reassigning points. The elbow method plots WCSS vs K to find where adding more clusters gives diminishing returns.
Finding K
The elbow method plots inertia (within-cluster variance) against K. We're looking for the point where adding another cluster stops giving meaningful improvement, the "elbow." Silhouette analysis confirms the choice by measuring how well-separated the clusters are.
Both methods point to K=5. The elbow is clear at K=5, and the silhouette score peaks there at 0.55. Beyond K=5, we'd be splitting natural groups for marginal variance reduction.
Elbow Method
Inertia drops sharply up to K=5, then flattens. The “elbow” at K=5 indicates the optimal number of clusters.
Silhouette Scores
K=5 achieves the highest silhouette score (0.55), confirming well-separated clusters.
Cross-validation with hierarchical clustering
Ward's linkage merges the pair of clusters that causes the smallest increase in total variance at each step, the same objective K-Means optimizes but solved greedily from the bottom up instead of iteratively from random centroids.
Cutting the dendrogram at the right height yields 5 clusters. The colored branches below the cut line correspond to our segments.
Dendrogram (Ward Linkage)
Ward linkage merges clusters to minimize within-cluster variance. Cutting the dendrogram at the dashed line yields the same 5 segments as K-Means.
Linkage & Cluster Quality
Hierarchical clustering with Ward's linkage greedily merges cluster pairs to minimize variance increase. The silhouette coefficient then validates how well-separated the resulting clusters are.
Ward's Linkage Criterion
Ward's method merges the pair of clusters that causes the smallest increase in total within-cluster variance. It tends to produce compact, similarly-sized clusters, which is why our hierarchical results closely match K-Means (ARI = 0.942).
Silhouette Coefficient
Measures how similar a point is to its own cluster (a) vs the nearest neighbor cluster (b). Ranges from -1 to 1. Our K=5 solution achieves ~0.55, indicating reasonably well-separated clusters. a(i) is the mean intra-cluster distance, b(i) is the mean nearest-cluster distance.
The five segments
The clusters map to five distinct customer archetypes. Each has a clear behavioral pattern and an actionable marketing strategy, which is the whole point of segmentation.
The most interesting finding: "Window Shoppers," high earners who barely spend at the mall. They have the income to be top customers but aren't engaged. That's the segment with the highest ROI potential for targeted campaigns.
High Rollers
Affluent customers who spend freely. The most valuable segment with both high income and high spending scores.
Strategy
VIP loyalty programs, early access to premium products, personalized concierge services.
Budget Enthusiasts
Lower income but high spending tendency. Likely younger demographic prioritizing lifestyle over savings.
Strategy
Installment plans, discount bundles, credit-friendly promotions to sustain engagement.
Average Joes
The moderate middle. Neither extremes in income nor spending, the largest and most stable segment.
Strategy
Seasonal promotions, loyalty point systems, cross-selling mid-tier products.
Window Shoppers
High earners who barely spend. Possibly price-conscious or shopping elsewhere. An untapped opportunity.
Strategy
Targeted campaigns highlighting value, exclusive in-store experiences, premium product sampling.
Penny Pinchers
Limited income and minimal spending. Low immediate value, but can be cultivated with the right incentives.
Strategy
Budget product lines, email marketing with deals, referral rewards programs.
Do both methods agree?
The Adjusted Rand Index measures how well two clustering solutions agree, corrected for chance. An ARI of 1.0 means perfect agreement; 0.0 means no better than random. Our two methods score 0.942, with only 8 of 200 customers are assigned differently.
The agreement matrix shows an almost perfectly diagonal structure. The few disagreements are boundary customers near cluster edges, exactly where you'd expect ambiguity.
Adjusted Rand Index
Near-perfect agreement between K-Means and Hierarchical clustering. An ARI of 1.0 means identical assignments; 0.0 means random. Our 0.942 confirms the 5-segment structure is robust.
Cluster Agreement Matrix
| K-Means / Hier | HI/HS | LI/HS | Avg | HI/LS | LI/LS |
|---|---|---|---|---|---|
| HI/HS | 34 | 0 | 1 | 0 | 0 |
| LI/HS | 0 | 20 | 2 | 0 | 0 |
| Avg | 0 | 2 | 77 | 2 | 0 |
| HI/LS | 0 | 0 | 1 | 38 | 0 |
| LI/LS | 0 | 0 | 0 | 0 | 23 |
The near-diagonal structure shows K-Means and Hierarchical agree on almost every customer. Only 8 of 200 customers are assigned differently.
Agreement Metric
The Adjusted Rand Index measures how well two independent clustering solutions agree, corrected for chance agreement.
Adjusted Rand Index
Measures agreement between two clustering assignments, adjusted for chance. Our ARI of 0.942 between K-Means and Hierarchical means the two methods agree on cluster membership for ~97% of customers, confirming the segments are robust and not artifacts of a single algorithm.
Code
import pandas as pdfrom sklearn.preprocessing import StandardScaler# Load mall customer datadf = pd.read_csv('Mall_Customers.csv')print(df.shape) # (200, 5)print(df.columns)# CustomerID, Gender, Age, Annual Income (k$), Spending Score (1-100)# Select clustering featuresX = df[['Annual Income (k$)', 'Spending Score (1-100)']]# Standardize features (zero mean, unit variance)scaler = StandardScaler()X_scaled = scaler.fit_transform(X)# Both features now contribute equally to distance calculationsTakeaways
Two features were enough. Income and spending score capture the behavioral signal; adding age/gender didn't improve separation.
Cross-validating with a second algorithm (ARI = 0.942) is cheap insurance against K-Means artifacts.
StandardScaler isn't optional. Without it, income's $15k-$137k range drowns out spending score's 1-99.
The highest-value insight isn't the biggest segment, it's the Window Shoppers. High income, low spend. That's a targeting opportunity, not a lost cause.