Density-Based Methods
Density-based clustering identifies clusters as regions of high point density separated by areas of low density. Points in sparse regions are typically labeled as noise or outliers.
- Core Idea: A cluster grows as long as there are enough nearby points (high density) within a specified radius.
- Key Advantage: No need to specify the number of clusters (( k )) upfront, unlike K-Means.
Core Concepts
-
Core Point:
- A point with at least ( MinPts ) other points (including itself) within a distance ( \epsilon ) (epsilon).
- These points form the “heart” of a cluster.
-
Border Point:
- A point within ( \epsilon ) of a core point but with fewer than ( MinPts ) neighbors.
- These are on the edge of a cluster.
-
Noise Point:
- A point that’s neither a core nor a border point (isolated, low-density).
- Treated as outliers.
-
Density-Reachability:
- A point ( q ) is density-reachable from ( p ) if there’s a chain of core points connecting them, each within ( \epsilon ).
-
Density-Connected:
- Two points are density-connected if there’s a core point from which both are density-reachable.
- This defines a cluster.
-
Parameters:
- ( \epsilon ): Radius of the neighborhood (distance threshold).
- ( MinPts ): Minimum number of points to form a dense region.
Key Algorithm: DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
DBSCAN is the most popular density-based method. It’s intuitive and widely used.
How It Works
- Input: Dataset, ( \epsilon ), ( MinPts ).
- Start: Pick an unvisited point.
- Core Check: If it has ( \geq MinPts ) neighbors within ( \epsilon ), it’s a core point; start a new cluster.
- Expand Cluster: Add all density-reachable points (core and border) to the cluster.
- Repeat: Move to the next unvisited point until all points are processed.
- Output: Clusters and noise points.
Example
Dataset: Points A(1, 1), B(1, 2), C(2, 1), D(5, 5), E(6, 6), F(10, 10).
Parameters: ( \epsilon = 1.5 ), ( MinPts = 3 ), Euclidean distance.
- A(1, 1): Neighbors = {B(1, 2), C(2, 1)} (distance ≈ 1) → 3 points (including A) → Core point.
- B(1, 2): Neighbors = {A, C} → Core point.
- C(2, 1): Neighbors = {A, B} → Core point.
- D(5, 5): Neighbor = {E(6, 6)} (distance ≈ 1.41) → 2 points < ( MinPts ) → Not core.
- E(6, 6): Neighbor = {D} → Not core.
- F(10, 10): No neighbors → Noise.
Result:
- Cluster 1: {A, B, C} (density-connected core points).
- Noise: {D, E, F} (not dense enough).
Pros and Cons
- Pros:
- Finds arbitrary shapes (e.g., rings, crescents).
- Handles noise/outliers naturally.
- No need to specify ( k ).
- Cons:
- Sensitive to ( \epsilon ) and ( MinPts ) (hard to tune).
- Struggles with varying density across clusters.
- O(n²) time complexity without indexing (O(n log n) with spatial indexing).
Other Density-Based Methods
-
OPTICS (Ordering Points To Identify the Clustering Structure):
- Extends DBSCAN to handle varying densities.
- How it works:
- Orders points by core distance (distance to nearest ( MinPts )-th neighbor) and reachability distance.
- Produces a hierarchical structure (like a dendrogram).
- Pros: More flexible than DBSCAN, extracts clusters at multiple density levels.
- Cons: Complex output, slower computation.
-
DENCLUE (DENsity CLUstering):
- Uses a density function (e.g., Gaussian kernel) to model data density.
- Clusters are peaks in the density landscape.
- Pros: Smooth clusters, handles noise well.
- Cons: Requires tuning kernel parameters, computationally intensive.
-
HDBSCAN (Hierarchical DBSCAN):
- Combines DBSCAN with hierarchical clustering.
- How it works: Builds a hierarchy of DBSCAN clusters over varying ( \epsilon ), selects stable clusters.
- Pros: Robust to varying density, no single ( \epsilon ) choice.
- Cons: More complex, slower than DBSCAN.
Example Application
- Customer Segmentation: Cluster shopping locations {latitude, longitude}. Dense urban areas form clusters; rural outliers are noise.
- Anomaly Detection: In network traffic, dense patterns are normal; sparse points are potential threats.
Comparison with Other Methods
Feature | Density-Based (DBSCAN) | Partitioning (K-Means) | Hierarchical (Agglomerative) |
---|---|---|---|
Cluster Shape | Arbitrary | Spherical | Depends on linkage |
Noise Handling | Excellent (explicit) | Poor | Poor (merges all) |
# Clusters | Automatic | Must specify ( k ) | Flexible (cut dendrogram) |
Scalability | Moderate (with indexing) | High | Low (O(n³)) |
Visual Intuition
Imagine points on a map:
- DBSCAN: Groups dense neighborhoods into clusters, leaves isolated houses as noise.
- K-Means: Forces everything into ( k ) circular regions, even outliers.
- Hierarchical: Builds a tree, merging or splitting based on proximity.