Back to Blog
RAG
LLMs
Production AI
Lessons Learned

7 Hard-Won Lessons from Building Production RAG Systems

RAG demos are easy. Production RAG systems are hard. Here's what we learned building them for real businesses.

AppVision Team
January 20, 2024
7 min read

Introduction

Retrieval-Augmented Generation (RAG) is having a moment. The pitch is compelling: give an LLM access to your documents, and suddenly it can answer questions about your specific data.

The demos are everywhere: "Ask questions about your PDFs!" "Chat with your documentation!" "AI assistant for your knowledge base!"

But then you try to ship a RAG system to production, and reality hits hard.

This post shares 7 lessons we learned the hard way building RAG systems that handle real workloads for real businesses.

Lesson 1: Chunking Strategy Matters More Than You Think

The naive approach: Split documents every 500 tokens. Use sliding window. Done.

What we learned: This works for demos. It fails for production.

Better approach:

  • Semantic chunking: Split at logical boundaries (sections, paragraphs)
  • Hierarchical chunking: Keep document structure (this section is inside that chapter)
  • Metadata-rich chunks: Every chunk knows its source, author, date, section hierarchy

Real example: A legal document chunked at arbitrary token boundaries might split a clause across chunks. Your retrieval will return incomplete context, and the LLM will generate nonsense.

What we do now:

def chunk_document(doc):
    # Parse document structure
    sections = parse_structure(doc)

    # Create chunks respecting section boundaries
    chunks = []
    for section in sections:
        # Include parent context in metadata
        chunk = {
            "content": section.text,
            "metadata": {
                "document_id": doc.id,
                "section_path": section.hierarchy,
                "section_title": section.title,
                "document_date": doc.date,
                # Critical: include parent context
                "parent_context": section.parent.summary
            }
        }
        chunks.append(chunk)

    return chunks

Lesson 2: Retrieval Quality is Your Real Bottleneck

Common assumption: "Better embeddings = better RAG system."

Reality: Embeddings are just one piece. The whole retrieval pipeline matters.

Key factors:

  1. Query preprocessing: Reformulate user questions before retrieval
  2. Hybrid search: Combine semantic (embeddings) with keyword search
  3. Reranking: Use a cross-encoder to rerank top results
  4. Metadata filtering: Let users filter by date, source, type

Real numbers from our healthcare RAG project:

  • Embeddings alone: 67% relevance
    • Query preprocessing: 74% relevance
    • Hybrid search: 81% relevance
    • Reranking: 89% relevance

The pipeline:

User Query → Query Expansion → Hybrid Search → Reranking → Top K Chunks → LLM

Lesson 3: Citations Aren't Optional

The temptation: Just return the LLM's answer. It's based on the documents, right?

Why this fails: Users don't trust answers they can't verify.

What works: Every statement links back to source documents.

Implementation:

  • Track which chunks contributed to each part of the answer
  • Return chunk IDs alongside the generated text
  • Provide UI to click through to original documents

Example response:

{
  "answer": "The Q4 revenue was $2.3M [1], up 15% from Q3 [2].",
  "sources": [
    {
      "id": 1,
      "document": "Q4_earnings_report.pdf",
      "page": 3,
      "snippet": "...revenue reached $2.3M..."
    },
    {
      "id": 2,
      "document": "Q3_earnings_report.pdf",
      "page": 2,
      "snippet": "...revenue of $2.0M..."
    }
  ]
}

Trust improves dramatically with citations. In our healthcare project, physician trust went from 62% to 94% once we added source citations.

Lesson 4: Your Data is Messier Than You Think

What you imagine: Clean, well-formatted documents with clear structure.

What you get:

  • PDFs that are actually scanned images (OCR needed)
  • Word docs with formatting soup
  • Spreadsheets with business logic embedded in formulas
  • HTML pages with broken formatting
  • Documents with mixed languages
  • Files with corrupted metadata

The solution: Invest heavily in document preprocessing.

Our preprocessing pipeline:

  1. Format detection: Identify file type and handle accordingly
  2. OCR: For image-based PDFs (use Azure Document Intelligence or AWS Textract)
  3. Structure extraction: Tables, lists, headings
  4. Quality checks: Flag documents that may need manual review
  5. Metadata extraction: Dates, authors, categories

Budget for this: Document preprocessing typically takes 30-40% of project time.

Lesson 5: Prompt Engineering for RAG is Different

Standard LLM prompting: "You are a helpful assistant..."

RAG prompting requires:

  • Clear instructions on how to use retrieved context
  • Guidelines for handling conflicting information
  • Instructions for citing sources
  • Guidance on when to say "I don't know"

A prompt template that works:

You are an AI assistant with access to company documents.

CONTEXT:
{retrieved_chunks}

USER QUESTION:
{user_question}

INSTRUCTIONS:
1. Answer ONLY using information from the provided context
2. If the context doesn't contain relevant information, say "I don't have enough information to answer that"
3. Cite sources using [1], [2], etc. matching the chunk IDs
4. If chunks contain conflicting information, acknowledge this and present both viewpoints
5. Be precise - don't make assumptions beyond what the context states

ANSWER:

Critical addition: Few-shot examples showing good and bad answers.

Lesson 6: Evaluation is Harder Than Building

Challenge: How do you know if your RAG system is working well?

Naive approach: "Try some questions and see if answers look good."

Production approach: Systematic evaluation across multiple dimensions.

Metrics we track:

  1. Retrieval Quality

    • Precision@K (are top K results relevant?)
    • Recall@K (did we find all relevant docs?)
    • MRR (Mean Reciprocal Rank)
  2. Generation Quality

    • Groundedness (answer supported by context?)
    • Relevance (answer addresses the question?)
    • Citation accuracy (sources actually used?)
  3. User Satisfaction

    • Thumbs up/down on answers
    • Follow-up question rate
    • Correction rate

The eval dataset:

  • Start with 50-100 hand-crafted question/answer pairs
  • Add real user questions (especially ones users rated poorly)
  • Continuously expand based on edge cases

Tools:

  • RAGAS (open-source RAG evaluation)
  • Phoenix (observability for LLM apps)
  • LangSmith (if you're using LangChain)

Lesson 7: Costs Can Spiral Fast

The demo: Process 100 documents, answer 20 questions, spend $5.

Production: Process 100K documents, answer 10K questions/day...

Cost breakdown for a typical RAG system:

  • Embeddings: $X per 1M tokens (one-time for new docs, ongoing for queries)
  • Vector database: Storage + queries
  • LLM generation: $Y per 1M tokens (biggest variable cost)
  • Reranking models: If using commercial APIs

Optimization strategies:

  1. Cache aggressively

    • Cache query embeddings for common questions
    • Cache full responses for frequent queries
    • Use semantic similarity to serve cached answers for similar questions
  2. Smart model selection

    • Use smaller models for simple queries
    • Route complex queries to larger models
    • Use fine-tuned models for high-volume tasks
  3. Optimize context windows

    • Don't send more chunks than needed (typical: 3-5 chunks)
    • Experiment with chunk size vs. retrieval count
    • Use summarization for very long documents

Real numbers: With optimization, we typically reduce costs by 60-70% vs. initial implementation.

Bonus: What Good RAG Systems Have

Based on our production deployments, here's the full stack:

Infrastructure:

  • Robust document ingestion pipeline
  • Vector database with backup and monitoring
  • LLM observability (LangSmith, Phoenix, or similar)

Features:

  • Semantic + keyword hybrid search
  • Metadata filtering
  • Citation tracking
  • User feedback collection
  • A/B testing framework

Operations:

  • Retrieval quality monitoring
  • Cost tracking and alerting
  • Regular evaluation on test sets
  • Incident response playbook

Conclusion

RAG is powerful, but production RAG systems require:

  • Thoughtful chunking strategy
  • Sophisticated retrieval pipeline
  • Mandatory source citations
  • Heavy investment in data preprocessing
  • Specialized prompt engineering
  • Systematic evaluation
  • Aggressive cost optimization

The good news: once you get these pieces right, RAG systems can handle queries that would be impossible with traditional search or pure LLMs.


Want help building a production-ready RAG system? Let's talk.

Ready to build your AI system?

Let's discuss how we can help you ship production AI.

Book a Call