RAG Preparation Guide
Retrieval-Augmented Generation (RAG) lets you build AI systems that "know" your organization's specific information. But the quality of your RAG system depends entirely on the quality of data you feed it.
What is RAG?
RAG combines two powerful capabilities:
- Retrieval: Finding relevant information from your documents
- Generation: Using AI to create responses based on that information
The result: An AI that can answer questions using your company's actual data, not just general internet knowledge.
The Document Preparation Checklist
Phase 1: Inventory
Before anything else, know what you have:
| Document Type | Location | Format | Last Updated | Owner |
|---|---|---|---|---|
| Policies | Shared Drive | 2024-01 | HR | |
| Procedures | Wiki | HTML | 2023-08 | Ops |
| FAQs | Website | HTML | 2024-03 | Marketing |
| Contracts | Legal Folder | DOCX | Various | Legal |
Phase 2: Quality Assessment
Rate each document source:
- Accuracy: Is the information still correct?
- Completeness: Are there gaps or missing sections?
- Clarity: Would someone new understand it?
- Currency: When was it last reviewed?
- Authority: Is this the official version?
Phase 3: Format Standardization
RAG systems work best with clean, structured text. Prepare for conversion:
Good formats for RAG:
- Markdown (.md)
- Plain text (.txt)
- Well-structured HTML
- Clean Word documents
Challenging formats:
- Scanned PDFs (need OCR)
- Complex spreadsheets
- Presentations with minimal text
- Images with embedded text
Content Structuring Best Practices
Use Clear Headings
Before:
Our refund policy is as follows. Customers have 30 days to return items. Items must be unused. Refunds are processed in 5-7 business days.
After:
Refund Policy
Return Window
Customers have 30 days from purchase to return items.
Condition Requirements
Items must be unused and in original packaging.
Processing Time
Refunds are processed within 5-7 business days.
Include Context
Documents often assume knowledge that AI won't have. Add context:
Before:
Submit form 1042-B to the third floor.
After:
Submit form 1042-B (Vendor Registration Request) to the Procurement Department on the third floor of the Main Office building.
Remove Ambiguity
Before:
Contact the team for more information.
After:
Contact the Customer Support team at support@company.com or 555-123-4567 for more information.
Document Types and Preparation Notes
Policies and Procedures
Key considerations:
- Include effective dates
- Note approval authority
- Add revision history
- Cross-reference related policies
Structure suggestion:
# Policy Name
## Purpose
## Scope (Who this applies to)
## Policy Statement
## Procedures
## Exceptions
## Related Policies
## Revision History
FAQs and Knowledge Base
Key considerations:
- One question per section
- Include common variations of questions
- Provide complete, standalone answers
- Update based on actual user questions
Structure suggestion:
## Question: [Exact question as users ask it]
**Short Answer:** [1-2 sentence response]
**Detailed Answer:** [Full explanation]
**Related Questions:** [Links to similar topics]
Technical Documentation
Key considerations:
- Define acronyms on first use
- Include prerequisites
- Provide context for steps
- Note common errors and solutions
Training Materials
Key considerations:
- Separate reference material from exercises
- Include learning objectives
- Provide standalone definitions
- Note prerequisites and sequence
Chunking Strategy
RAG systems break documents into "chunks" for retrieval. How you structure content affects chunking:
Recommended Chunk Sizes
| Content Type | Ideal Chunk Size | Reason |
|---|---|---|
| FAQs | 1 Q&A pair | Self-contained answers |
| Procedures | 1 section | Complete task context |
| Policies | 1 subsection | Focused policy statements |
| Reference | 300-500 words | Enough context without overflow |
Natural Chunk Boundaries
Use these as natural breaking points:
- Major headings (H2, H3)
- Complete procedures or processes
- Individual policy sections
- Single topics or concepts
Metadata Matters
Good metadata helps retrieval accuracy:
| Metadata Field | Purpose | Example |
|---|---|---|
| Title | Quick identification | "Remote Work Policy" |
| Category | Filtering | "HR Policies" |
| Last Updated | Currency check | "2024-03-15" |
| Author/Owner | Authority | "HR Department" |
| Audience | Relevance | "All Employees" |
| Keywords | Discovery | "work from home, WFH, telecommute" |
Common Preparation Mistakes
Mistake 1: Including Everything
Problem: Outdated or duplicate documents pollute results Solution: Curate ruthlessly; less is often more
Mistake 2: Keeping Formatting
Problem: Complex formatting breaks during extraction Solution: Simplify to clean text with basic structure
Mistake 3: Ignoring Context
Problem: Documents reference things without explanation Solution: Make each document standalone when possible
Mistake 4: Skipping Review
Problem: Errors in source documents become AI "facts" Solution: Verify accuracy before ingestion
Mistake 5: Set and Forget
Problem: Information becomes stale Solution: Establish regular review cycles
Implementation Roadmap
Week 1-2: Discovery
- Complete document inventory
- Identify document owners
- Assess current quality
Week 3-4: Prioritization
- Score documents by importance and quality
- Identify quick wins (high value, good quality)
- Flag documents needing significant work
Week 5-8: Preparation
- Clean and restructure priority documents
- Add missing context and metadata
- Convert to preferred formats
Week 9-10: Validation
- Review prepared documents for accuracy
- Test sample retrievals
- Adjust structure based on results
Ongoing: Maintenance
- Establish update triggers (policy changes, new products)
- Assign ownership for ongoing accuracy
- Schedule quarterly reviews
Quality Metrics
Track these to ensure ongoing quality:
- Coverage: What percentage of common questions can be answered?
- Accuracy: Are retrieved answers correct?
- Freshness: How current is the information?
- Completeness: Do answers fully address questions?
Next Steps
- Create your document inventory spreadsheet
- Identify your top 10 most-referenced documents
- Apply the preparation checklist to one document
- Review the result with subject matter experts
- Scale the process to remaining priority documents