Skip to main content

RAG Preparation Guide

Retrieval-Augmented Generation (RAG) lets you build AI systems that "know" your organization's specific information. But the quality of your RAG system depends entirely on the quality of data you feed it.

What is RAG?

RAG combines two powerful capabilities:

  1. Retrieval: Finding relevant information from your documents
  2. Generation: Using AI to create responses based on that information

The result: An AI that can answer questions using your company's actual data, not just general internet knowledge.

The Document Preparation Checklist

Phase 1: Inventory

Before anything else, know what you have:

Document TypeLocationFormatLast UpdatedOwner
PoliciesShared DrivePDF2024-01HR
ProceduresWikiHTML2023-08Ops
FAQsWebsiteHTML2024-03Marketing
ContractsLegal FolderDOCXVariousLegal

Phase 2: Quality Assessment

Rate each document source:

  • Accuracy: Is the information still correct?
  • Completeness: Are there gaps or missing sections?
  • Clarity: Would someone new understand it?
  • Currency: When was it last reviewed?
  • Authority: Is this the official version?

Phase 3: Format Standardization

RAG systems work best with clean, structured text. Prepare for conversion:

Good formats for RAG:

  • Markdown (.md)
  • Plain text (.txt)
  • Well-structured HTML
  • Clean Word documents

Challenging formats:

  • Scanned PDFs (need OCR)
  • Complex spreadsheets
  • Presentations with minimal text
  • Images with embedded text

Content Structuring Best Practices

Use Clear Headings

Before:

Our refund policy is as follows. Customers have 30 days to return items. Items must be unused. Refunds are processed in 5-7 business days.

After:

Refund Policy

Return Window

Customers have 30 days from purchase to return items.

Condition Requirements

Items must be unused and in original packaging.

Processing Time

Refunds are processed within 5-7 business days.

Include Context

Documents often assume knowledge that AI won't have. Add context:

Before:

Submit form 1042-B to the third floor.

After:

Submit form 1042-B (Vendor Registration Request) to the Procurement Department on the third floor of the Main Office building.

Remove Ambiguity

Before:

Contact the team for more information.

After:

Contact the Customer Support team at support@company.com or 555-123-4567 for more information.

Document Types and Preparation Notes

Policies and Procedures

Key considerations:

  • Include effective dates
  • Note approval authority
  • Add revision history
  • Cross-reference related policies

Structure suggestion:

# Policy Name
## Purpose
## Scope (Who this applies to)
## Policy Statement
## Procedures
## Exceptions
## Related Policies
## Revision History

FAQs and Knowledge Base

Key considerations:

  • One question per section
  • Include common variations of questions
  • Provide complete, standalone answers
  • Update based on actual user questions

Structure suggestion:

## Question: [Exact question as users ask it]

**Short Answer:** [1-2 sentence response]

**Detailed Answer:** [Full explanation]

**Related Questions:** [Links to similar topics]

Technical Documentation

Key considerations:

  • Define acronyms on first use
  • Include prerequisites
  • Provide context for steps
  • Note common errors and solutions

Training Materials

Key considerations:

  • Separate reference material from exercises
  • Include learning objectives
  • Provide standalone definitions
  • Note prerequisites and sequence

Chunking Strategy

RAG systems break documents into "chunks" for retrieval. How you structure content affects chunking:

Content TypeIdeal Chunk SizeReason
FAQs1 Q&A pairSelf-contained answers
Procedures1 sectionComplete task context
Policies1 subsectionFocused policy statements
Reference300-500 wordsEnough context without overflow

Natural Chunk Boundaries

Use these as natural breaking points:

  • Major headings (H2, H3)
  • Complete procedures or processes
  • Individual policy sections
  • Single topics or concepts

Metadata Matters

Good metadata helps retrieval accuracy:

Metadata FieldPurposeExample
TitleQuick identification"Remote Work Policy"
CategoryFiltering"HR Policies"
Last UpdatedCurrency check"2024-03-15"
Author/OwnerAuthority"HR Department"
AudienceRelevance"All Employees"
KeywordsDiscovery"work from home, WFH, telecommute"

Common Preparation Mistakes

Mistake 1: Including Everything

Problem: Outdated or duplicate documents pollute results Solution: Curate ruthlessly; less is often more

Mistake 2: Keeping Formatting

Problem: Complex formatting breaks during extraction Solution: Simplify to clean text with basic structure

Mistake 3: Ignoring Context

Problem: Documents reference things without explanation Solution: Make each document standalone when possible

Mistake 4: Skipping Review

Problem: Errors in source documents become AI "facts" Solution: Verify accuracy before ingestion

Mistake 5: Set and Forget

Problem: Information becomes stale Solution: Establish regular review cycles

Implementation Roadmap

Week 1-2: Discovery

  • Complete document inventory
  • Identify document owners
  • Assess current quality

Week 3-4: Prioritization

  • Score documents by importance and quality
  • Identify quick wins (high value, good quality)
  • Flag documents needing significant work

Week 5-8: Preparation

  • Clean and restructure priority documents
  • Add missing context and metadata
  • Convert to preferred formats

Week 9-10: Validation

  • Review prepared documents for accuracy
  • Test sample retrievals
  • Adjust structure based on results

Ongoing: Maintenance

  • Establish update triggers (policy changes, new products)
  • Assign ownership for ongoing accuracy
  • Schedule quarterly reviews

Quality Metrics

Track these to ensure ongoing quality:

  • Coverage: What percentage of common questions can be answered?
  • Accuracy: Are retrieved answers correct?
  • Freshness: How current is the information?
  • Completeness: Do answers fully address questions?

Next Steps

  1. Create your document inventory spreadsheet
  2. Identify your top 10 most-referenced documents
  3. Apply the preparation checklist to one document
  4. Review the result with subject matter experts
  5. Scale the process to remaining priority documents