Support Vector Machines
Support Vector Machines (SVMs) are powerful supervised learning models used for classification and regression tasks in data mining. They are particularly effective in high-dimensional spaces and for problems with clear margin separation.
1. Core Concepts of SVM
A. What is an SVM?
An SVM is a discriminative classifier that finds the optimal hyperplane separating data points of different classes with the maximum margin.
B. Key Terminology
- Hyperplane: Decision boundary (e.g., a line in 2D, plane in 3D).
- Support Vectors: Data points closest to the hyperplane (critical for margin).
- Margin: Distance between the hyperplane and the nearest data points.
- Kernel Trick: Maps data into higher dimensions to handle non-linear separation.
2. How SVM Works
A. Linear SVM (Hard Margin)
- Goal: Find a hyperplane that perfectly separates classes.
- Mathematical Formulation:
[
w \cdot x + b = 0
]
where:
- ( w ) = weight vector.
- ( b ) = bias term.
- Optimization Objective: [ \text{Minimize } \frac{1}{2} |w|^2 \quad \text{subject to } y_i (w \cdot x_i + b) \geq 1 ]
B. Soft Margin SVM (Handling Overlapping Classes)
- Problem: Data may not be perfectly separable.
- Solution: Introduce slack variables ( \xi_i ) to allow misclassification.
- Optimization:
[
\text{Minimize } \frac{1}{2} |w|^2 + C \sum \xi_i
]
- ( C ): Penalty parameter (larger ( C ) → stricter margin).
C. Non-Linear SVM (Kernel Trick)
- Problem: Data may not be linearly separable.
- Solution: Use kernel functions to map data into higher dimensions.
- Common Kernels:
Kernel Formula Use Case Linear ( K(x_i, x_j) = x_i \cdot x_j ) Linearly separable data Polynomial ( (x_i \cdot x_j + c)^d ) Moderate non-linearity RBF (Gaussian) ( e^{-\gamma |x_i - x_j|^2} ) Highly non-linear data Sigmoid ( \tanh(\alpha x_i \cdot x_j + c) ) Neural network-like models
3. Applications in Data Mining
A. Text Classification
- Example: Spam detection.
- Why SVM? Handles high-dimensional text data well.
B. Image Recognition
- Example: Handwritten digit classification (MNIST).
- Why SVM? Effective with feature extraction (e.g., HOG, SIFT).
C. Bioinformatics
- Example: Cancer classification from gene expression data.
- Why SVM? Works well with small sample sizes and high dimensions.
D. Anomaly Detection
- Example: Fraud detection.
- Why SVM? One-class SVM can model normal behavior.
4. Advantages & Disadvantages
Advantages | Disadvantages |
---|---|
Effective in high-dimensional spaces | Computationally expensive for large datasets |
Robust to overfitting (with good ( C )) | Requires careful kernel selection |
Works well with small datasets | Black-box model (hard to interpret) |
5. SVM vs. Other Classifiers
Classifier | When to Use | Comparison with SVM |
---|---|---|
Logistic Regression | Simple linear problems | SVM better for clear margin separation |
Decision Trees | Interpretability needed | SVM better for high-dimensional data |
k-NN | Lazy learning, small datasets | SVM more efficient for large feature sets |
6. Practical Example: SVM in Python
from sklearn import svm
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
# Load data
iris = load_iris()
X, y = iris.data, iris.target
# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
# Train SVM
model = svm.SVC(kernel='rbf', C=1.0, gamma='scale')
model.fit(X_train, y_train)
# Evaluate
accuracy = model.score(X_test, y_test)
print(f"Accuracy: {accuracy:.2f}")
7. Parameter Tuning in SVM
A. Key Parameters
- ( C ): Controls trade-off between margin and misclassification.
- Low ( C ) → Wider margin, more errors.
- High ( C ) → Narrow margin, fewer errors.
- ( \gamma ) (RBF kernel): Controls influence of individual points.
- Low ( \gamma ) → Far influence (smoother boundaries).
- High ( \gamma ) → Near influence (complex boundaries).
B. Tuning with Grid Search
from sklearn.model_selection import GridSearchCV
params = {
'C': [0.1, 1, 10],
'gamma': [0.1, 1, 'scale']
}
grid = GridSearchCV(svm.SVC(kernel='rbf'), params, cv=5)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
8. Key Takeaways
- SVM maximizes margin for robust classification.
- Kernel trick enables non-linear decision boundaries.
- Critical parameters: ( C ), kernel type, ( \gamma ).
- Best for: High-dimensional data, small-to-medium datasets.