1. Genetic Algorithm
A GA is a metaheuristic search method that mimics Darwinian evolution:
- Population: Set of candidate solutions (chromosomes).
- Fitness Function: Evaluates solution quality.
- Selection: Favors high-fitness individuals.
- Crossover & Mutation: Generates new solutions.
2. Key Components of Genetic Algorithms
Component | Description |
---|---|
Chromosome | Encoded solution (binary, real-valued, permutation). |
Population | Group of chromosomes (e.g., 50–100 candidates). |
Fitness Function | Measures solution quality (e.g., accuracy, RMSE). |
Selection | Chooses parents for reproduction (e.g., roulette wheel, tournament). |
Crossover | Combines parent genes to produce offspring (e.g., single-point, uniform). |
Mutation | Introduces random changes to maintain diversity (e.g., bit flip, Gaussian). |
Termination | Stops when convergence is reached (e.g., max generations, no improvement). |
3. How Genetic Algorithms Work in Data Mining
Step-by-Step Workflow
-
Initialization
- Generate a random population of solutions (e.g., random feature subsets).
-
Fitness Evaluation
- Compute fitness (e.g., model accuracy using selected features).
-
Selection
- Retain top-performing chromosomes (e.g., rank-based selection).
-
Crossover
- Combine parents to create offspring (e.g., swap feature subsets).
-
Mutation
- Randomly alter genes (e.g., add/drop a feature).
-
Termination
- Repeat until convergence (e.g., 100 generations).
4. Applications in Data Mining
A. Feature Selection
- Problem: High-dimensional data reduces model efficiency.
- GA Approach:
- Chromosome: Binary string (1 = feature selected, 0 = excluded).
- Fitness: Classification accuracy (e.g., SVM with selected features).
- Example:
# Chromosome: [1, 0, 1, 1] → Features 1, 3, 4 selected. fitness = accuracy_score(y_test, model.fit(X_train[:, chromosome], y_train).predict(X_test[:, chromosome]))
B. Clustering
- Problem: Optimal cluster number/assignment is NP-hard.
- GA Approach:
- Chromosome: Encodes cluster centroids (e.g., [μ₁, μ₂, μ₃]).
- Fitness: Inverse of within-cluster variance.
- Example:
def fitness(chromosome): kmeans = KMeans(n_clusters=3, init=chromosome.reshape(3, 2)) return -kmeans.inertia_ # Minimize variance
C. Classification Rule Discovery
- Problem: Extract interpretable “IF-THEN” rules.
- GA Approach:
- Chromosome: Encodes rule conditions (e.g.,
IF Age>30 AND Income<50k THEN Churn=Yes
). - Fitness: Rule coverage + accuracy.
- Chromosome: Encodes rule conditions (e.g.,
D. Hyperparameter Tuning
- Problem: Grid search is computationally expensive.
- GA Approach:
- Chromosome: Encodes hyperparameters (e.g., [learning_rate=0.01, depth=5]).
- Fitness: Cross-validation score.
5. Advantages of Genetic Algorithms
Advantage | Why It Matters |
---|---|
Global Optimization | Avoids local optima (vs. gradient descent). |
Handles Non-Differentiable Problems | Works where calculus-based methods fail (e.g., feature selection). |
Parallelizable | Evaluates multiple solutions simultaneously. |
Interpretability | Generates human-readable rules (e.g., in classification). |
6. Challenges & Solutions
Challenge | Solution |
---|---|
High Computational Cost | Use elitism (preserve top solutions), limit generations. |
Premature Convergence | Increase mutation rate, use niche techniques. |
Parameter Tuning | Adaptive GAs (e.g., self-adjusting mutation rates). |
7. Example: Feature Selection with GA
Python Implementation (using DEAP)
from deap import base, creator, tools, algorithms
import numpy as np
# 1. Define fitness and chromosome structure
creator.create("FitnessMax", base.Fitness, weights=(1.0,))
creator.create("Individual", list, fitness=creator.FitnessMax)
toolbox = base.Toolbox()
toolbox.register("attr_bool", np.random.randint, 0, 2) # Binary encoding
toolbox.register("individual", tools.initRepeat, creator.Individual, toolbox.attr_bool, n=10) # 10 features
toolbox.register("population", tools.initRepeat, list, toolbox.individual)
# 2. Fitness function (e.g., SVM accuracy)
def eval_feature_subset(individual):
selected_features = [i for i, val in enumerate(individual) if val == 1]
if not selected_features:
return 0.0,
X_subset = X[:, selected_features]
return (cross_val_score(SVC(), X_subset, y, cv=5).mean(),) # Mean CV accuracy
toolbox.register("evaluate", eval_feature_subset)
toolbox.register("mate", tools.cxTwoPoint)
toolbox.register("mutate", tools.mutFlipBit, indpb=0.1)
toolbox.register("select", tools.selTournament, tournsize=3)
# 3. Evolve the population
population = toolbox.population(n=50)
algorithms.eaSimple(population, toolbox, cxpb=0.5, mutpb=0.2, ngen=10, verbose=True)
# 4. Extract best solution
best_individual = tools.selBest(population, k=1)[0]
print("Selected Features:", [i for i, val in enumerate(best_individual) if val == 1])
8. Key Takeaways
- GAs are versatile for optimization in data mining (feature selection, clustering, etc.).
- Evolutionary operators (crossover, mutation) balance exploration and exploitation.
- Combine with ML models (e.g., SVM, K-means) for enhanced performance.
- Use libraries like DEAP, TPOT, or scikit-learn’s
GeneticAlgorithmCV
.