Skip to main content

Data Hygiene

Clean data is the foundation of effective AI systems. This guide covers essential data cleaning protocols for RAG (Retrieval-Augmented Generation) and other AI applications.

The Golden Rules

Follow these non-negotiable rules for data cleanliness:

1. No Merged Cells

Merged cells break parsing, indexing, and machine readability.

Bad PracticeGood Practice
Merged cells spanning rows/columnsOne value per cell
Hidden data in merged regionsExplicit data in each cell

2. ISO 8601 Dates

Always use the international standard for dates:

Format: YYYY-MM-DDTHH:MM:SSZ

Examples:
- 2024-01-15
- 2024-01-15T14:30:00Z
- 2024-01-15T14:30:00-05:00

Never use:

  • 01/15/2024 (ambiguous)
  • January 15, 2024 (text format)
  • 15-01-24 (unclear century)

3. Consistent Naming

Establish naming conventions and enforce them rigorously.

Column Names:

Good: customer_id, created_at, total_amount
Bad: Customer ID, CreatedAt, Total $ Amount

Category Values:

Good: "active", "inactive", "pending"
Bad: "Active", "ACTIVE", "active ", "Active "

Data Cleaning Checklist

Use this checklist before feeding data into any AI system:

Structure

  • No merged cells
  • Headers in first row only
  • One data type per column
  • No empty rows within data
  • No hidden rows or columns

Formatting

  • Dates in ISO 8601 format
  • Numbers without formatting (no $, %, commas)
  • Text trimmed of leading/trailing whitespace
  • Consistent capitalization
  • No special characters in identifiers

Content

  • No duplicate records
  • Missing values handled (null, not blank)
  • Outliers reviewed and documented
  • Categories standardized
  • References validated

Common Data Issues

Issue: Inconsistent Nulls

Problem: Multiple representations of "no data"

Empty string: ""
Text: "N/A", "null", "none", "-"
Numbers: 0, -1, 9999

Solution: Standardize on explicit NULL values

UPDATE table SET field = NULL 
WHERE field IN ('N/A', 'null', 'none', '-', '');

Issue: Trailing Whitespace

Problem: "Active " !== "Active"

Solution: Trim all text fields

df['status'] = df['status'].str.strip()

Issue: Mixed Data Types

Problem: Column contains both numbers and text

Solution: Separate into appropriate columns or convert consistently

RAG-Specific Considerations

When preparing data for Retrieval-Augmented Generation:

Chunking Strategy

  • Keep semantic units together
  • Maintain context with overlapping chunks
  • Include metadata with each chunk

Metadata Requirements

Each document chunk should include:

  • Source document ID
  • Section/page reference
  • Creation/update timestamp
  • Author/owner attribution
  • Classification/category tags

Quality Signals

Tag content with quality indicators:

  • Verified vs. unverified
  • Official vs. user-generated
  • Current vs. archived

Real-World Examples

Example 1: CRM Data Cleanup

Before cleaning:

Name: john smith
Email: JOHN.SMITH@COMPANY.COM
Phone: (555) 123-4567
Status: active
Created: Jan 15, 2024

After cleaning:

name: John Smith
email: john.smith@company.com
phone: 5551234567
status: active
created_at: 2024-01-15T00:00:00Z

Changes made:

  • Standardized capitalization (Title Case for names)
  • Normalized email to lowercase
  • Removed formatting from phone numbers
  • Converted date to ISO 8601 format

Example 2: Product Catalog Normalization

Before:

SKUPriceCategory
ABC-001$29.99Electronics
abc-00234.99electronics
ABC-003$39.99 USDELECTRONICS

After:

skuprice_centscurrencycategory
ABC-0012999USDelectronics
ABC-0023499USDelectronics
ABC-0033999USDelectronics

Changes made:

  • Uppercase SKUs for consistency
  • Removed currency symbols, stored as cents
  • Added explicit currency field
  • Lowercase categories

Example 3: Survey Response Standardization

Before:

Q: How satisfied are you?
A1: "Very Satisfied"
A2: "very satisfied"
A3: "5 out of 5"
A4: "Extremely happy!"
A5: "10/10"

After (with mapping):

Q: satisfaction_score
A1: 5
A2: 5
A3: 5
A4: 5
A5: 5

Common Mistakes and How to Avoid Them

Mistake 1: Over-Cleaning

Problem: Removing data that seems like noise but is actually meaningful Example: Stripping "Jr." or "III" from names Solution: Create explicit rules for what to keep; when in doubt, preserve original data in a separate field

Mistake 2: Locale-Blind Processing

Problem: Applying US-centric rules to international data Example: Assuming all phone numbers are 10 digits Solution: Use libraries that handle international formats (libphonenumber for phones, country-specific validators for addresses)

Mistake 3: Destructive Transformations

Problem: Overwriting original data without backups Solution: Always keep the raw data; create cleaned copies in new columns or tables

Mistake 4: One-Time Cleaning

Problem: Cleaning data once and assuming it stays clean Solution: Implement validation at data entry points; schedule regular cleaning jobs

Mistake 5: Ignoring Edge Cases

Problem: Rules that work for 95% of data but corrupt the other 5% Example: Converting "NULL" to null, but also converting a company named "Null Security Inc." to null Solution: Review edge cases manually; build exception lists

Spreadsheet Tools

  • Google Sheets: TRIM(), CLEAN(), PROPER(), built-in data validation
  • Excel: Power Query for complex transformations
  • OpenRefine: Free tool specifically designed for messy data

Programming Libraries

  • Python: pandas for DataFrames, regex for pattern matching
  • SQL: Window functions for deduplication, CASE statements for standardization

AI-Assisted Cleaning

  • Use LLMs to suggest data quality rules
  • Generate regex patterns from examples
  • Classify unstructured text into categories

Data Quality Platforms

  • Great Expectations: Open-source data validation
  • Monte Carlo: Data observability and monitoring
  • Atlan: Data catalog with quality features

Building a Data Hygiene Process

Step 1: Profile Your Data

Before cleaning, understand what you have:

  • What percentage of fields are filled?
  • What's the distribution of values?
  • Are there obvious outliers?

Step 2: Define Quality Rules

Document specific rules for each field:

  • Acceptable formats
  • Required vs. optional
  • Allowed values (for enums)
  • Validation patterns (for emails, phones)

Step 3: Automate Where Possible

  • Implement validation at data entry
  • Schedule regular cleaning scripts
  • Set up alerts for quality degradation

Step 4: Monitor Continuously

  • Track data quality metrics over time
  • Alert when quality drops below thresholds
  • Review and update rules quarterly