Sequential Pattern mining

Sequential pattern mining finds frequent subsequences in a dataset of sequences. A sequence is an ordered list of events or itemsets, often tied to time or position.

Example: In a customer shopping dataset:
- Sequence: <{milk}, {bread}, {butter}> means a customer bought milk, then bread, then butter over time.
- Goal: Find patterns like <{milk}, {bread}> that appear frequently across many customers.
Applications:
- Web navigation: <homepage, product page, checkout>.
- Bioinformatics: DNA sequences (e.g., <A, C, G, T>).
- Retail: Purchase sequences over months.

Key Concepts

Sequence:
- An ordered list of itemsets (events).
- Notation: <{a}, {b, c}, {d}> means “a happens, then b and c together, then d.”
- Itemsets within a step (e.g., {b, c}) are unordered, but the steps are ordered.
Subsequence:
- A sequence S1 is a subsequence of S2 if all items of S1 appear in S2 in the same order (not necessarily consecutive).
- Example: <{a}, {c}> is a subsequence of <{a}, {b}, {c, d}>.
Support:
- The percentage of sequences in the dataset that contain a given subsequence.
- Formula: Support(S) = (Number of sequences containing S) / (Total sequences).
- Example: If <{milk}, {bread}> appears in 3 out of 5 customer sequences, support = 3/5 = 60%.
Frequent Sequential Pattern:
- A subsequence with support ≥ a user-defined minimum support threshold (min_sup).
- Example: If min_sup = 50%, <{milk}, {bread}> (60%) is frequent.
Constraints (optional):
- Time gaps: Events must occur within a certain time window (e.g., 1 day).
- Length: Limit pattern length (e.g., max 3 steps).
- Item constraints: Include specific items (e.g., must contain {milk}).

Example Dataset

Let’s use a simple dataset of customer purchase sequences:

Customer ID	Sequence
C1	<{milk}, {bread}, {butter}>
C2	<{milk}, {bread}>
C3	<{beer}, {milk}, {butter}>
C4	<{milk}, {bread}, {beer}>
C5	<{bread}, {butter}>

Total sequences: 5.
Min_sup: 40% (i.e., 2/5 = 2 sequences).

Frequent Patterns:

<{milk}>: 4/5 = 80% ✓
<{bread}>: 4/5 = 80% ✓
<{butter}>: 3/5 = 60% ✓
<{milk}, {bread}>: 3/5 = 60% ✓
<{bread}, {butter}>: 2/5 = 40% ✓
<{milk}, {butter}>: 2/5 = 40% ✓

Key Algorithms

GSP (Generalized Sequential Patterns):
- How it works:
  - Similar to Apriori: Starts with frequent 1-sequences, then builds longer candidates.
  - Uses a level-wise approach, scanning the database multiple times.
- Steps:
  1. Find frequent 1-sequences (e.g., <{milk}>).
  2. Generate candidate 2-sequences (e.g., <{milk}, {bread}>) and check support.
  3. Repeat until no more frequent sequences.
- Pros: Simple, supports constraints like time gaps.
- Cons: Multiple scans, slow for large datasets.
PrefixSpan (Prefix-Projected Sequential Pattern Mining):
- How it works:
  - Projects the database into smaller subsets based on prefixes, avoiding candidate generation.
  - Grows patterns recursively.
- Steps (for above dataset):
  1. Start with <{milk}> (support 4):
    - Project database: Look at what follows <{milk}> in C1, C2, C3, C4.
    - Sub-sequences: <{bread}> (3), <{butter}> (2).
  2. Extend: <{milk}, {bread}> (support 3), <{milk}, {butter}> (support 2).
  3. Repeat for other prefixes (e.g., <{bread}>).
- Pros: Faster, single-pass after projection, memory-efficient.
- Cons: Complex to implement.
SPADE (Sequential PAttern Discovery using Equivalence classes):
- How it works:
  - Uses a vertical database format (item-to-sequence ID mapping).
  - Finds patterns via intersection of ID lists.
- Example: <{milk}> in [C1, C2, C3, C4], <{bread}> in [C1, C2, C4, C5]; intersect to find <{milk}, {bread}> in [C1, C2, C4].
- Pros: Efficient for dense data, single-pass possible.
- Cons: Memory overhead for sparse data.

Advanced Concepts

Closed Sequential Patterns:
- A sequence is closed if no super-sequence has the same support.
- Example: If <{milk}, {bread}> and <{milk}, {bread}, {butter}> both have support 3, only the latter is closed.
- Benefit: Reduces output size.
Maximal Sequential Patterns:
- A sequence is maximal if no super-sequence is frequent.
- Example: <{milk}, {bread}, {butter}> might be maximal if <{milk}, {bread}, {butter}, {beer}> isn’t frequent.
Time Constraints:
- Min_gap/Max_gap: Events must occur within a time window.
- Example: <{milk}, {bread}> only counts if bread is bought within 7 days of milk.
Sliding Windows:
- For streaming data, patterns are mined over a fixed recent window (e.g., last 100 transactions).

Quantinium

Explorer

Sequential Pattern mining

Key Concepts

Example Dataset

Key Algorithms

Advanced Concepts

Table of Contents