Data Hygiene
Clean data is the foundation of effective AI systems. This guide covers essential data cleaning protocols for RAG (Retrieval-Augmented Generation) and other AI applications.
The Golden Rules
Follow these non-negotiable rules for data cleanliness:
1. No Merged Cells
Merged cells break parsing, indexing, and machine readability.
| Bad Practice | Good Practice |
|---|---|
| Merged cells spanning rows/columns | One value per cell |
| Hidden data in merged regions | Explicit data in each cell |
2. ISO 8601 Dates
Always use the international standard for dates:
Format: YYYY-MM-DDTHH:MM:SSZ
Examples:
- 2024-01-15
- 2024-01-15T14:30:00Z
- 2024-01-15T14:30:00-05:00
Never use:
- 01/15/2024 (ambiguous)
- January 15, 2024 (text format)
- 15-01-24 (unclear century)
3. Consistent Naming
Establish naming conventions and enforce them rigorously.
Column Names:
Good: customer_id, created_at, total_amount
Bad: Customer ID, CreatedAt, Total $ Amount
Category Values:
Good: "active", "inactive", "pending"
Bad: "Active", "ACTIVE", "active ", "Active "
Data Cleaning Checklist
Use this checklist before feeding data into any AI system:
Structure
- No merged cells
- Headers in first row only
- One data type per column
- No empty rows within data
- No hidden rows or columns
Formatting
- Dates in ISO 8601 format
- Numbers without formatting (no $, %, commas)
- Text trimmed of leading/trailing whitespace
- Consistent capitalization
- No special characters in identifiers
Content
- No duplicate records
- Missing values handled (null, not blank)
- Outliers reviewed and documented
- Categories standardized
- References validated
Common Data Issues
Issue: Inconsistent Nulls
Problem: Multiple representations of "no data"
Empty string: ""
Text: "N/A", "null", "none", "-"
Numbers: 0, -1, 9999
Solution: Standardize on explicit NULL values
UPDATE table SET field = NULL
WHERE field IN ('N/A', 'null', 'none', '-', '');
Issue: Trailing Whitespace
Problem: "Active " !== "Active"
Solution: Trim all text fields
df['status'] = df['status'].str.strip()
Issue: Mixed Data Types
Problem: Column contains both numbers and text
Solution: Separate into appropriate columns or convert consistently
RAG-Specific Considerations
When preparing data for Retrieval-Augmented Generation:
Chunking Strategy
- Keep semantic units together
- Maintain context with overlapping chunks
- Include metadata with each chunk
Metadata Requirements
Each document chunk should include:
- Source document ID
- Section/page reference
- Creation/update timestamp
- Author/owner attribution
- Classification/category tags
Quality Signals
Tag content with quality indicators:
- Verified vs. unverified
- Official vs. user-generated
- Current vs. archived