Data Preprocessing

Data preprocessing transforms raw, messy data into a clean, structured format suitable for analysis. It’s often considered the most time-consuming part of data mining (taking up to 70-80% of the effort), but it’s essential for reliable results.


Data Cleaning

Definition: Data cleaning (or cleansing) involves detecting and correcting errors, inconsistencies, and missing values in the dataset to improve its quality.

Real-world data is rarely perfect—think typos, duplicates, or gaps from faulty sensors. Cleaning ensures the mined patterns reflect reality, not noise.

Key Tasks in Data Cleaning

  1. Handling Missing Values

    • Problem: Data points are absent (e.g., a customer’s age is blank).
    • Techniques:
      • Ignore: Skip incomplete records (if minor and random).
        • Example: Dropping 5 rows out of 10,000 where age is missing.
        • Pros: Simple. Cons: Loses data.
      • Imputation: Fill in missing values.
        • Mean/Median/Mode: Use average (numerical) or most frequent (categorical) value.
          • Example: Age missing? Use the dataset’s average age (e.g., 35).
        • Regression/KNN: Predict missing values based on other features.
          • Example: Predict age using income and location via a model.
        • Manual/Default: Assign a placeholder (e.g., “Unknown” or -1).
      • Detail: Choice depends on data type and missingness pattern (random vs. systematic).
    • Challenge: Over-imputation can bias results (e.g., too many “average” ages).
  2. Removing Noise

    • Problem: Erroneous or irrelevant data (e.g., a temperature reading of -500°C).
    • Techniques:
      • Smoothing: Average out fluctuations (e.g., moving average).
        • Example: Replace a spiked sales value with a 3-day average.
      • Outlier Detection: Identify and remove anomalies.
        • Methods: Z-score (>3 standard deviations), IQR (outside 1.5× interquartile range).
        • Example: A 10-$100 items is flagged.
      • Detail: Noise vs. outlier depends on context—outliers might be valid (e.g., a billionaire’s purchase).
    • Challenge: Distinguishing noise from meaningful rare events.
  3. Resolving Inconsistencies

    • Problem: Data contradicts itself (e.g., “Male” in one column, “Pregnant” in another).
    • Techniques:
      • Standardization: Uniform formats (e.g., “USA” vs. “U.S.” → “United States”).
        • Example: Convert all dates to YYYY-MM-DD.
      • Validation Rules: Check against domain knowledge (e.g., age can’t be negative).
        • Example: Flag records where birth year > current year.
      • Deduplication: Remove redundant entries.
        • Example: Merge two “John Doe” records with the same ID.
    • Detail: Often requires domain expertise or external reference data (e.g., postal code lists).
  4. Correcting Errors

    • Problem: Typos or data entry mistakes (e.g., “Nwe York” instead of “New York”).
    • Techniques:
      • String Matching: Fuzzy matching to fix typos.
        • Example: “Nwe York” → “New York” using Levenshtein distance.
      • Data Profiling: Analyze patterns to spot errors (e.g., all weights should be positive).
    • Challenge: Automated fixes can introduce new errors if not validated.

Tools & Methods

  • Software: Python (Pandas, NumPy), R, SQL, or ETL tools (Talend, Informatica).
  • Metrics: Completeness (%), accuracy (error rate), consistency (mismatch rate).

Example Workflow

  • Dataset: Customer records with age, name, purchase amount.
  • Issues: Missing ages, duplicate names, negative purchases.
  • Cleaning: Impute ages with median, merge duplicates by ID, set negative purchases to zero.

Discretization

Definition: Discretization transforms continuous data (e.g., numerical ranges) into discrete categories or intervals, making it easier to analyze or use in certain algorithms.

Many data mining techniques (e.g., decision trees, association rules) work better with categorical data. Plus, it simplifies interpretation.

Key Types of Discretization

  1. Unsupervised Discretization

    • No prior knowledge or target variable used.
    • Techniques:
      • Equal-Width Binning:
        • What it does: Divides range into equal-sized intervals.
        • Example: Age 0-100 into 5 bins: 0-20, 21-40, 41-60, 61-80, 81-100.
        • Formula: Bin width = (max - min) / number of bins.
        • Pros: Simple, uniform. Cons: Sensitive to outliers (e.g., one 99 skews bins).
      • Equal-Frequency Binning:
        • What it does: Divides data so each bin has roughly equal number of records.
        • Example: 100 ages into 5 bins, each with 20 values (e.g., 0-25, 26-35, …).
        • Pros: Balances data distribution. Cons: Bin ranges vary, less intuitive.
    • Detail: Fast but may not capture meaningful patterns.
  2. Supervised Discretization

    • Uses a target variable to guide binning (common in predictive modeling).
    • Techniques:
      • Entropy-Based (e.g., Fayyad & Irani’s Method):
        • What it does: Splits based on minimizing entropy (information loss) relative to a class label.
        • Example: Age split at 30 and 50 if those boundaries best predict “buys product” vs. “doesn’t.”
        • Detail: Recursive partitioning, often used with decision trees.
      • Chi-Merge:
        • What it does: Merges adjacent intervals if they’re statistically similar (via Chi-square test).
        • Example: Ages 20-25 and 26-30 merged if purchase behavior isn’t significantly different.
    • Pros: Tailored to the task. Cons: Computationally intensive.
  3. Custom Discretization

    • Manually defined based on domain knowledge.
    • Example: Age bins like “Child (0-12),” “Teen (13-19),” “Adult (20-64),” “Senior (65+).”
    • Detail: Intuitive but requires expertise.

Steps in Discretization

  1. Choose Attribute: Pick the continuous variable (e.g., income).
  2. Determine Method: Equal-width, equal-frequency, or supervised.
  3. Set Number of Bins: Fixed (e.g., 5) or dynamic (algorithm-driven).
  4. Apply Transformation: Replace values with bin labels or midpoints.
    • Example: Income 50K → “Low,” 100K → “Medium.”

Benefits

  • Simplifies data (e.g., “High” vs. exact $87,432).
  • Reduces noise sensitivity.
  • Enables categorical algorithms (e.g., Naive Bayes).

Challenges

  • Information Loss: Binning discards precision (e.g., 49.9 vs. 50.1 lumped together).
  • Bin Choice: Too few bins oversimplify; too many retain complexity.
  • Outliers: Can distort unsupervised methods.

Example Workflow

  • Data: Salaries from 200K.
  • Method: Equal-width, 4 bins.
  • Result: 66K-110K (“Medium”), 156K-200K (“Very High”).

Cleaning vs. Discretization

  • Cleaning: Fixes errors and prepares raw data (e.g., fills missing ages).
  • Discretization: Transforms clean continuous data into categories (e.g., bins ages into groups).
  • Order: Cleaning comes first—discretizing dirty data risks amplifying errors.