Skip to main content

Data Hygiene

Clean data is the foundation of effective AI systems. This guide covers essential data cleaning protocols for RAG (Retrieval-Augmented Generation) and other AI applications.

The Golden Rules

Follow these non-negotiable rules for data cleanliness:

1. No Merged Cells

Merged cells break parsing, indexing, and machine readability.

Bad PracticeGood Practice
Merged cells spanning rows/columnsOne value per cell
Hidden data in merged regionsExplicit data in each cell

2. ISO 8601 Dates

Always use the international standard for dates:

Format: YYYY-MM-DDTHH:MM:SSZ

Examples:
- 2024-01-15
- 2024-01-15T14:30:00Z
- 2024-01-15T14:30:00-05:00

Never use:

  • 01/15/2024 (ambiguous)
  • January 15, 2024 (text format)
  • 15-01-24 (unclear century)

3. Consistent Naming

Establish naming conventions and enforce them rigorously.

Column Names:

Good: customer_id, created_at, total_amount
Bad: Customer ID, CreatedAt, Total $ Amount

Category Values:

Good: "active", "inactive", "pending"
Bad: "Active", "ACTIVE", "active ", "Active "

Data Cleaning Checklist

Use this checklist before feeding data into any AI system:

Structure

  • No merged cells
  • Headers in first row only
  • One data type per column
  • No empty rows within data
  • No hidden rows or columns

Formatting

  • Dates in ISO 8601 format
  • Numbers without formatting (no $, %, commas)
  • Text trimmed of leading/trailing whitespace
  • Consistent capitalization
  • No special characters in identifiers

Content

  • No duplicate records
  • Missing values handled (null, not blank)
  • Outliers reviewed and documented
  • Categories standardized
  • References validated

Common Data Issues

Issue: Inconsistent Nulls

Problem: Multiple representations of "no data"

Empty string: ""
Text: "N/A", "null", "none", "-"
Numbers: 0, -1, 9999

Solution: Standardize on explicit NULL values

UPDATE table SET field = NULL 
WHERE field IN ('N/A', 'null', 'none', '-', '');

Issue: Trailing Whitespace

Problem: "Active " !== "Active"

Solution: Trim all text fields

df['status'] = df['status'].str.strip()

Issue: Mixed Data Types

Problem: Column contains both numbers and text

Solution: Separate into appropriate columns or convert consistently

RAG-Specific Considerations

When preparing data for Retrieval-Augmented Generation:

Chunking Strategy

  • Keep semantic units together
  • Maintain context with overlapping chunks
  • Include metadata with each chunk

Metadata Requirements

Each document chunk should include:

  • Source document ID
  • Section/page reference
  • Creation/update timestamp
  • Author/owner attribution
  • Classification/category tags

Quality Signals

Tag content with quality indicators:

  • Verified vs. unverified
  • Official vs. user-generated
  • Current vs. archived