Data Hygiene
Clean data is the foundation of effective AI systems. This guide covers essential data cleaning protocols for RAG (Retrieval-Augmented Generation) and other AI applications.
The Golden Rules
Follow these non-negotiable rules for data cleanliness:
1. No Merged Cells
Merged cells break parsing, indexing, and machine readability.
| Bad Practice | Good Practice |
|---|---|
| Merged cells spanning rows/columns | One value per cell |
| Hidden data in merged regions | Explicit data in each cell |
2. ISO 8601 Dates
Always use the international standard for dates:
Format: YYYY-MM-DDTHH:MM:SSZ
Examples:
- 2024-01-15
- 2024-01-15T14:30:00Z
- 2024-01-15T14:30:00-05:00
Never use:
- 01/15/2024 (ambiguous)
- January 15, 2024 (text format)
- 15-01-24 (unclear century)
3. Consistent Naming
Establish naming conventions and enforce them rigorously.
Column Names:
Good: customer_id, created_at, total_amount
Bad: Customer ID, CreatedAt, Total $ Amount
Category Values:
Good: "active", "inactive", "pending"
Bad: "Active", "ACTIVE", "active ", "Active "
Data Cleaning Checklist
Use this checklist before feeding data into any AI system:
Structure
- No merged cells
- Headers in first row only
- One data type per column
- No empty rows within data
- No hidden rows or columns
Formatting
- Dates in ISO 8601 format
- Numbers without formatting (no $, %, commas)
- Text trimmed of leading/trailing whitespace
- Consistent capitalization
- No special characters in identifiers
Content
- No duplicate records
- Missing values handled (null, not blank)
- Outliers reviewed and documented
- Categories standardized
- References validated
Common Data Issues
Issue: Inconsistent Nulls
Problem: Multiple representations of "no data"
Empty string: ""
Text: "N/A", "null", "none", "-"
Numbers: 0, -1, 9999
Solution: Standardize on explicit NULL values
UPDATE table SET field = NULL
WHERE field IN ('N/A', 'null', 'none', '-', '');
Issue: Trailing Whitespace
Problem: "Active " !== "Active"
Solution: Trim all text fields
df['status'] = df['status'].str.strip()
Issue: Mixed Data Types
Problem: Column contains both numbers and text
Solution: Separate into appropriate columns or convert consistently
RAG-Specific Considerations
When preparing data for Retrieval-Augmented Generation:
Chunking Strategy
- Keep semantic units together
- Maintain context with overlapping chunks
- Include metadata with each chunk
Metadata Requirements
Each document chunk should include:
- Source document ID
- Section/page reference
- Creation/update timestamp
- Author/owner attribution
- Classification/category tags
Quality Signals
Tag content with quality indicators:
- Verified vs. unverified
- Official vs. user-generated
- Current vs. archived
Real-World Examples
Example 1: CRM Data Cleanup
Before cleaning:
Name: john smith
Email: JOHN.SMITH@COMPANY.COM
Phone: (555) 123-4567
Status: active
Created: Jan 15, 2024
After cleaning:
name: John Smith
email: john.smith@company.com
phone: 5551234567
status: active
created_at: 2024-01-15T00:00:00Z
Changes made:
- Standardized capitalization (Title Case for names)
- Normalized email to lowercase
- Removed formatting from phone numbers
- Converted date to ISO 8601 format
Example 2: Product Catalog Normalization
Before:
| SKU | Price | Category |
|---|---|---|
| ABC-001 | $29.99 | Electronics |
| abc-002 | 34.99 | electronics |
| ABC-003 | $39.99 USD | ELECTRONICS |
After:
| sku | price_cents | currency | category |
|---|---|---|---|
| ABC-001 | 2999 | USD | electronics |
| ABC-002 | 3499 | USD | electronics |
| ABC-003 | 3999 | USD | electronics |
Changes made:
- Uppercase SKUs for consistency
- Removed currency symbols, stored as cents
- Added explicit currency field
- Lowercase categories
Example 3: Survey Response Standardization
Before:
Q: How satisfied are you?
A1: "Very Satisfied"
A2: "very satisfied"
A3: "5 out of 5"
A4: "Extremely happy!"
A5: "10/10"
After (with mapping):
Q: satisfaction_score
A1: 5
A2: 5
A3: 5
A4: 5
A5: 5
Common Mistakes and How to Avoid Them
Mistake 1: Over-Cleaning
Problem: Removing data that seems like noise but is actually meaningful Example: Stripping "Jr." or "III" from names Solution: Create explicit rules for what to keep; when in doubt, preserve original data in a separate field
Mistake 2: Locale-Blind Processing
Problem: Applying US-centric rules to international data Example: Assuming all phone numbers are 10 digits Solution: Use libraries that handle international formats (libphonenumber for phones, country-specific validators for addresses)
Mistake 3: Destructive Transformations
Problem: Overwriting original data without backups Solution: Always keep the raw data; create cleaned copies in new columns or tables
Mistake 4: One-Time Cleaning
Problem: Cleaning data once and assuming it stays clean Solution: Implement validation at data entry points; schedule regular cleaning jobs
Mistake 5: Ignoring Edge Cases
Problem: Rules that work for 95% of data but corrupt the other 5% Example: Converting "NULL" to null, but also converting a company named "Null Security Inc." to null Solution: Review edge cases manually; build exception lists
Recommended Tools
Spreadsheet Tools
- Google Sheets: TRIM(), CLEAN(), PROPER(), built-in data validation
- Excel: Power Query for complex transformations
- OpenRefine: Free tool specifically designed for messy data
Programming Libraries
- Python: pandas for DataFrames, regex for pattern matching
- SQL: Window functions for deduplication, CASE statements for standardization
AI-Assisted Cleaning
- Use LLMs to suggest data quality rules
- Generate regex patterns from examples
- Classify unstructured text into categories
Data Quality Platforms
- Great Expectations: Open-source data validation
- Monte Carlo: Data observability and monitoring
- Atlan: Data catalog with quality features
Building a Data Hygiene Process
Step 1: Profile Your Data
Before cleaning, understand what you have:
- What percentage of fields are filled?
- What's the distribution of values?
- Are there obvious outliers?
Step 2: Define Quality Rules
Document specific rules for each field:
- Acceptable formats
- Required vs. optional
- Allowed values (for enums)
- Validation patterns (for emails, phones)
Step 3: Automate Where Possible
- Implement validation at data entry
- Schedule regular cleaning scripts
- Set up alerts for quality degradation
Step 4: Monitor Continuously
- Track data quality metrics over time
- Alert when quality drops below thresholds
- Review and update rules quarterly