7 Hard-Won Lessons from Building Production RAG Systems
RAG demos are easy. Production RAG systems are hard. Here's what we learned building them for real businesses.
Introduction
Retrieval-Augmented Generation (RAG) is having a moment. The pitch is compelling: give an LLM access to your documents, and suddenly it can answer questions about your specific data.
The demos are everywhere: "Ask questions about your PDFs!" "Chat with your documentation!" "AI assistant for your knowledge base!"
But then you try to ship a RAG system to production, and reality hits hard.
This post shares 7 lessons we learned the hard way building RAG systems that handle real workloads for real businesses.
Lesson 1: Chunking Strategy Matters More Than You Think
The naive approach: Split documents every 500 tokens. Use sliding window. Done.
What we learned: This works for demos. It fails for production.
Better approach:
- Semantic chunking: Split at logical boundaries (sections, paragraphs)
- Hierarchical chunking: Keep document structure (this section is inside that chapter)
- Metadata-rich chunks: Every chunk knows its source, author, date, section hierarchy
Real example: A legal document chunked at arbitrary token boundaries might split a clause across chunks. Your retrieval will return incomplete context, and the LLM will generate nonsense.
What we do now:
def chunk_document(doc):
# Parse document structure
sections = parse_structure(doc)
# Create chunks respecting section boundaries
chunks = []
for section in sections:
# Include parent context in metadata
chunk = {
"content": section.text,
"metadata": {
"document_id": doc.id,
"section_path": section.hierarchy,
"section_title": section.title,
"document_date": doc.date,
# Critical: include parent context
"parent_context": section.parent.summary
}
}
chunks.append(chunk)
return chunks
Lesson 2: Retrieval Quality is Your Real Bottleneck
Common assumption: "Better embeddings = better RAG system."
Reality: Embeddings are just one piece. The whole retrieval pipeline matters.
Key factors:
- Query preprocessing: Reformulate user questions before retrieval
- Hybrid search: Combine semantic (embeddings) with keyword search
- Reranking: Use a cross-encoder to rerank top results
- Metadata filtering: Let users filter by date, source, type
Real numbers from our healthcare RAG project:
- Embeddings alone: 67% relevance
-
- Query preprocessing: 74% relevance
-
- Hybrid search: 81% relevance
-
- Reranking: 89% relevance
The pipeline:
User Query → Query Expansion → Hybrid Search → Reranking → Top K Chunks → LLM
Lesson 3: Citations Aren't Optional
The temptation: Just return the LLM's answer. It's based on the documents, right?
Why this fails: Users don't trust answers they can't verify.
What works: Every statement links back to source documents.
Implementation:
- Track which chunks contributed to each part of the answer
- Return chunk IDs alongside the generated text
- Provide UI to click through to original documents
Example response:
{
"answer": "The Q4 revenue was $2.3M [1], up 15% from Q3 [2].",
"sources": [
{
"id": 1,
"document": "Q4_earnings_report.pdf",
"page": 3,
"snippet": "...revenue reached $2.3M..."
},
{
"id": 2,
"document": "Q3_earnings_report.pdf",
"page": 2,
"snippet": "...revenue of $2.0M..."
}
]
}
Trust improves dramatically with citations. In our healthcare project, physician trust went from 62% to 94% once we added source citations.
Lesson 4: Your Data is Messier Than You Think
What you imagine: Clean, well-formatted documents with clear structure.
What you get:
- PDFs that are actually scanned images (OCR needed)
- Word docs with formatting soup
- Spreadsheets with business logic embedded in formulas
- HTML pages with broken formatting
- Documents with mixed languages
- Files with corrupted metadata
The solution: Invest heavily in document preprocessing.
Our preprocessing pipeline:
- Format detection: Identify file type and handle accordingly
- OCR: For image-based PDFs (use Azure Document Intelligence or AWS Textract)
- Structure extraction: Tables, lists, headings
- Quality checks: Flag documents that may need manual review
- Metadata extraction: Dates, authors, categories
Budget for this: Document preprocessing typically takes 30-40% of project time.
Lesson 5: Prompt Engineering for RAG is Different
Standard LLM prompting: "You are a helpful assistant..."
RAG prompting requires:
- Clear instructions on how to use retrieved context
- Guidelines for handling conflicting information
- Instructions for citing sources
- Guidance on when to say "I don't know"
A prompt template that works:
You are an AI assistant with access to company documents.
CONTEXT:
{retrieved_chunks}
USER QUESTION:
{user_question}
INSTRUCTIONS:
1. Answer ONLY using information from the provided context
2. If the context doesn't contain relevant information, say "I don't have enough information to answer that"
3. Cite sources using [1], [2], etc. matching the chunk IDs
4. If chunks contain conflicting information, acknowledge this and present both viewpoints
5. Be precise - don't make assumptions beyond what the context states
ANSWER:
Critical addition: Few-shot examples showing good and bad answers.
Lesson 6: Evaluation is Harder Than Building
Challenge: How do you know if your RAG system is working well?
Naive approach: "Try some questions and see if answers look good."
Production approach: Systematic evaluation across multiple dimensions.
Metrics we track:
-
Retrieval Quality
- Precision@K (are top K results relevant?)
- Recall@K (did we find all relevant docs?)
- MRR (Mean Reciprocal Rank)
-
Generation Quality
- Groundedness (answer supported by context?)
- Relevance (answer addresses the question?)
- Citation accuracy (sources actually used?)
-
User Satisfaction
- Thumbs up/down on answers
- Follow-up question rate
- Correction rate
The eval dataset:
- Start with 50-100 hand-crafted question/answer pairs
- Add real user questions (especially ones users rated poorly)
- Continuously expand based on edge cases
Tools:
- RAGAS (open-source RAG evaluation)
- Phoenix (observability for LLM apps)
- LangSmith (if you're using LangChain)
Lesson 7: Costs Can Spiral Fast
The demo: Process 100 documents, answer 20 questions, spend $5.
Production: Process 100K documents, answer 10K questions/day...
Cost breakdown for a typical RAG system:
- Embeddings: $X per 1M tokens (one-time for new docs, ongoing for queries)
- Vector database: Storage + queries
- LLM generation: $Y per 1M tokens (biggest variable cost)
- Reranking models: If using commercial APIs
Optimization strategies:
-
Cache aggressively
- Cache query embeddings for common questions
- Cache full responses for frequent queries
- Use semantic similarity to serve cached answers for similar questions
-
Smart model selection
- Use smaller models for simple queries
- Route complex queries to larger models
- Use fine-tuned models for high-volume tasks
-
Optimize context windows
- Don't send more chunks than needed (typical: 3-5 chunks)
- Experiment with chunk size vs. retrieval count
- Use summarization for very long documents
Real numbers: With optimization, we typically reduce costs by 60-70% vs. initial implementation.
Bonus: What Good RAG Systems Have
Based on our production deployments, here's the full stack:
Infrastructure:
- Robust document ingestion pipeline
- Vector database with backup and monitoring
- LLM observability (LangSmith, Phoenix, or similar)
Features:
- Semantic + keyword hybrid search
- Metadata filtering
- Citation tracking
- User feedback collection
- A/B testing framework
Operations:
- Retrieval quality monitoring
- Cost tracking and alerting
- Regular evaluation on test sets
- Incident response playbook
Conclusion
RAG is powerful, but production RAG systems require:
- Thoughtful chunking strategy
- Sophisticated retrieval pipeline
- Mandatory source citations
- Heavy investment in data preprocessing
- Specialized prompt engineering
- Systematic evaluation
- Aggressive cost optimization
The good news: once you get these pieces right, RAG systems can handle queries that would be impossible with traditional search or pure LLMs.
Want help building a production-ready RAG system? Let's talk.