Advanced Pattern Mining

Advanced pattern mining builds on foundational methods to handle:

Complex data types: Sequences, graphs, streams, multi-dimensional data.
Constraints: User-defined rules to focus on meaningful patterns.
Scalability: Large-scale or dynamic datasets.
Specialized patterns: Beyond simple itemsets (e.g., maximal, closed, or rare patterns).

Beyond frequent itemsets, advanced mining uncovers richer patterns:

Maximal Frequent Itemsets (MFI):
- An itemset is maximal if it’s frequent and none of its supersets are frequent.
- Example: If {A, B, C} is frequent but {A, B, C, D} isn’t, {A, B, C} is maximal.
- Benefit: Reduces redundancy (fewer patterns to store).
Closed Frequent Itemsets:
- An itemset is closed if it’s frequent and no superset has the same support.
- Example: If {A, B} has support 50% and {A, B, C} also has 50%, only {A, B, C} is closed.
- Benefit: Compact representation without losing information.
Rare or Infrequent Patterns:
- Focus on patterns with low support but high significance (e.g., fraud detection).
- Challenge: Traditional algorithms like Apriori miss these due to high min-support thresholds.
High-Utility Itemsets (HUI):
- Considers item importance (e.g., profit, cost) beyond frequency.
- Example: {diamond ring} may appear rarely but has high utility (profit).
- Algorithms: HUI-Miner, FHM.

To mine these patterns efficiently, we go beyond Apriori and FP-Growth:

PrefixSpan (Sequential Pattern Mining):
- Mines patterns in ordered sequences (e.g., customer purchase sequences: {milk} → {bread} → {butter}).
- How it works: Projects the database into smaller prefix-based subsets, avoiding candidate generation.
- Use case: Web clickstream analysis, DNA sequence mining.
CM-SPADE (Sequential Mining with Constraints):
- Extends sequential mining with user constraints (e.g., length, time gaps).
- Faster than PrefixSpan for constrained problems.
Graph Pattern Mining (e.g., gSpan):
- Mines frequent subgraphs in graph datasets (e.g., social networks, chemical compounds).
- How it works: Uses a depth-first search to explore canonical graph representations.
- Use case: Drug discovery (finding common molecular structures).
Vertical Data Mining (ECLAT):
- Uses a vertical database format (item-to-transaction mapping) instead of horizontal (transaction-to-item).
- Benefit: Faster intersection operations, memory-efficient for sparse data.

Real-world applications often need specific patterns, not just frequent ones. Constraints refine the search:

Monotonic Constraints: If a pattern satisfies it, all supersets do too (e.g., support ≥ 5%).
Anti-Monotonic Constraints: If a pattern violates it, all supersets do too (e.g., cost ≤ $100).
Succinct Constraints: Directly expressible (e.g., item = “milk”).
Example: Find rules where {milk} appears on the left-hand side with confidence > 70%.

Algorithms like Apriori can be adapted to push constraints deep into the mining process, pruning irrelevant patterns early.

Static datasets are one thing, but real-time data (e.g., IoT, stock trades) requires advanced techniques:

Incremental Mining:
- Updates patterns as new data arrives without re-mining everything.
- Example: IncSpan for sequences, UF-Growth for FP-Trees.
Stream Mining:
- Handles unbounded data with sliding windows or decay factors.
- Algorithms: Lossy Counting, Moment (for closed patterns).
- Challenge: Limited memory, one-pass processing.

Real data often has attributes beyond items (e.g., time, location, category):

Multi-Dimensional Patterns:
- Example: {milk, bread} in {summer, NYC} with support 30%.
- Approach: Treat dimensions as items or use cube-based mining (e.g., BUC algorithm).
Multi-Level Patterns:
- Mines at different abstraction levels (e.g., {milk} → {bread} vs. {dairy} → {bakery}).
- Uses concept hierarchies to generalize or specialize patterns.

Imagine an e-commerce dataset:

Basic Mining: {laptop, charger} → {mouse} (support 10%, confidence 80%).
Advanced Mining:
- Maximal: {laptop, charger, mouse} is maximal if no larger set is frequent.
- Sequential: {laptop} → {charger} → {mouse} over three days.
- High-Utility: {laptop, warranty} has low frequency but high profit.
- Constraint: Rules where total cost > $500.

Quantinium