7 Hard-Won Lessons from Building Production RAG Systems

Introduction

Retrieval-Augmented Generation (RAG) is having a moment. The pitch is compelling: give an LLM access to your documents, and suddenly it can answer questions about your specific data.

The demos are everywhere: "Ask questions about your PDFs!" "Chat with your documentation!" "AI assistant for your knowledge base!"

But then you try to ship a RAG system to production, and reality hits hard.

This post shares 7 lessons we learned the hard way building RAG systems that handle real workloads for real businesses.

Lesson 1: Chunking Strategy Matters More Than You Think

The naive approach: Split documents every 500 tokens. Use sliding window. Done.

What we learned: This works for demos. It fails for production.

Better approach:

Semantic chunking: Split at logical boundaries (sections, paragraphs)
Hierarchical chunking: Keep document structure (this section is inside that chapter)
Metadata-rich chunks: Every chunk knows its source, author, date, section hierarchy

Real example: A legal document chunked at arbitrary token boundaries might split a clause across chunks. Your retrieval will return incomplete context, and the LLM will generate nonsense.

What we do now:

def chunk_document(doc):
    # Parse document structure
    sections = parse_structure(doc)

    # Create chunks respecting section boundaries
    chunks = []
    for section in sections:
        # Include parent context in metadata
        chunk = {
            "content": section.text,
            "metadata": {
                "document_id": doc.id,
                "section_path": section.hierarchy,
                "section_title": section.title,
                "document_date": doc.date,
                # Critical: include parent context
                "parent_context": section.parent.summary
            }
        }
        chunks.append(chunk)

    return chunks

Lesson 2: Retrieval Quality is Your Real Bottleneck

Common assumption: "Better embeddings = better RAG system."

Reality: Embeddings are just one piece. The whole retrieval pipeline matters.

Key factors:

Query preprocessing: Reformulate user questions before retrieval
Hybrid search: Combine semantic (embeddings) with keyword search
Reranking: Use a cross-encoder to rerank top results
Metadata filtering: Let users filter by date, source, type

Real numbers from our healthcare RAG project:

Embeddings alone: 67% relevance
- Query preprocessing: 74% relevance
- Hybrid search: 81% relevance
- Reranking: 89% relevance

The pipeline:

User Query → Query Expansion → Hybrid Search → Reranking → Top K Chunks → LLM

Lesson 3: Citations Aren't Optional

The temptation: Just return the LLM's answer. It's based on the documents, right?

Why this fails: Users don't trust answers they can't verify.

What works: Every statement links back to source documents.

Implementation:

Track which chunks contributed to each part of the answer
Return chunk IDs alongside the generated text
Provide UI to click through to original documents

Example response:

{
  "answer": "The Q4 revenue was $2.3M [1], up 15% from Q3 [2].",
  "sources": [
    {
      "id": 1,
      "document": "Q4_earnings_report.pdf",
      "page": 3,
      "snippet": "...revenue reached $2.3M..."
    },
    {
      "id": 2,
      "document": "Q3_earnings_report.pdf",
      "page": 2,
      "snippet": "...revenue of $2.0M..."
    }
  ]
}

Trust improves dramatically with citations. In our healthcare project, physician trust went from 62% to 94% once we added source citations.

Lesson 4: Your Data is Messier Than You Think

What you imagine: Clean, well-formatted documents with clear structure.

What you get:

PDFs that are actually scanned images (OCR needed)
Word docs with formatting soup
Spreadsheets with business logic embedded in formulas
HTML pages with broken formatting
Documents with mixed languages
Files with corrupted metadata

The solution: Invest heavily in document preprocessing.

Our preprocessing pipeline:

Format detection: Identify file type and handle accordingly
OCR: For image-based PDFs (use Azure Document Intelligence or AWS Textract)
Structure extraction: Tables, lists, headings
Quality checks: Flag documents that may need manual review
Metadata extraction: Dates, authors, categories

Budget for this: Document preprocessing typically takes 30-40% of project time.

Lesson 5: Prompt Engineering for RAG is Different

Standard LLM prompting: "You are a helpful assistant..."

RAG prompting requires:

Clear instructions on how to use retrieved context
Guidelines for handling conflicting information
Instructions for citing sources
Guidance on when to say "I don't know"

A prompt template that works:

You are an AI assistant with access to company documents.

CONTEXT:
{retrieved_chunks}

USER QUESTION:
{user_question}

INSTRUCTIONS:
1. Answer ONLY using information from the provided context
2. If the context doesn't contain relevant information, say "I don't have enough information to answer that"
3. Cite sources using [1], [2], etc. matching the chunk IDs
4. If chunks contain conflicting information, acknowledge this and present both viewpoints
5. Be precise - don't make assumptions beyond what the context states

ANSWER:

Critical addition: Few-shot examples showing good and bad answers.

Lesson 6: Evaluation is Harder Than Building

Challenge: How do you know if your RAG system is working well?

Naive approach: "Try some questions and see if answers look good."

Production approach: Systematic evaluation across multiple dimensions.

Metrics we track:

Retrieval Quality
- Precision@K (are top K results relevant?)
- Recall@K (did we find all relevant docs?)
- MRR (Mean Reciprocal Rank)
Generation Quality
- Groundedness (answer supported by context?)
- Relevance (answer addresses the question?)
- Citation accuracy (sources actually used?)
User Satisfaction
- Thumbs up/down on answers
- Follow-up question rate
- Correction rate

The eval dataset:

Start with 50-100 hand-crafted question/answer pairs
Add real user questions (especially ones users rated poorly)
Continuously expand based on edge cases

Tools:

RAGAS (open-source RAG evaluation)
Phoenix (observability for LLM apps)
LangSmith (if you're using LangChain)

Lesson 7: Costs Can Spiral Fast

The demo: Process 100 documents, answer 20 questions, spend $5.

Production: Process 100K documents, answer 10K questions/day...

Cost breakdown for a typical RAG system:

Embeddings: $X per 1M tokens (one-time for new docs, ongoing for queries)
Vector database: Storage + queries
LLM generation: $Y per 1M tokens (biggest variable cost)
Reranking models: If using commercial APIs

Optimization strategies:

Cache aggressively
- Cache query embeddings for common questions
- Cache full responses for frequent queries
- Use semantic similarity to serve cached answers for similar questions
Smart model selection
- Use smaller models for simple queries
- Route complex queries to larger models
- Use fine-tuned models for high-volume tasks
Optimize context windows
- Don't send more chunks than needed (typical: 3-5 chunks)
- Experiment with chunk size vs. retrieval count
- Use summarization for very long documents

Real numbers: With optimization, we typically reduce costs by 60-70% vs. initial implementation.

Bonus: What Good RAG Systems Have

Based on our production deployments, here's the full stack:

Infrastructure:

Robust document ingestion pipeline
Vector database with backup and monitoring
LLM observability (LangSmith, Phoenix, or similar)

Features:

Semantic + keyword hybrid search
Metadata filtering
Citation tracking
User feedback collection
A/B testing framework

Operations:

Retrieval quality monitoring
Cost tracking and alerting
Regular evaluation on test sets
Incident response playbook

Conclusion

RAG is powerful, but production RAG systems require:

Thoughtful chunking strategy
Sophisticated retrieval pipeline
Mandatory source citations
Heavy investment in data preprocessing
Specialized prompt engineering
Systematic evaluation
Aggressive cost optimization

The good news: once you get these pieces right, RAG systems can handle queries that would be impossible with traditional search or pure LLMs.

Want help building a production-ready RAG system? Let's talk.