Jump to: What is RAG? Prerequisites Setup Ingest Documents Query Agent Run It Extend It
What is RAG and Why Does It Matter?
RAG = Retrieval-Augmented Generation. The key to accurate, trustworthy AI answers.

Without RAG

  • LLM answers from training data only
  • Training data is from 2024 or earlier
  • Doesn't know your company's documents
  • Confidently makes up ("hallucinates") details
  • No citations — can't verify the answer

With RAG

  • Searches your documents for relevant passages
  • Feeds those passages into the LLM's context
  • LLM answers based only on your content
  • Cites the exact source document and section
  • Says "I don't know" when the answer isn't in your docs
The RAG pipeline in 3 steps:

1. Index: Load your documents → split into chunks → convert to vectors (embeddings) → store in a vector database
2. Retrieve: User asks a question → convert question to vector → find the most similar chunks in the database
3. Generate: Feed the retrieved chunks + the question to Claude/GPT → get a grounded, cited answer
Prerequisites
Step 1 — Project Setup
bash — create the project
mkdir doc-qa-agent && cd doc-qa-agent
python3 -m venv venv
source venv/bin/activate

pip install openai llama-index llama-index-embeddings-openai \
    llama-index-llms-openai pypdf python-dotenv

# Create the documents directory and add your files
mkdir documents
# Copy your PDFs or text files into the documents/ folder
.env
OPENAI_API_KEY=sk-...your-key...
Step 2 — Ingest Your Documents
This script loads all your documents, splits them into chunks, embeds them, and saves the index to disk. You run it once (or whenever your documents change).
ingest.py — run once to build the index
import os
from dotenv import load_dotenv
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

load_dotenv()

# Configure embedding model and LLM
Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    api_key=os.getenv("OPENAI_API_KEY")
)
Settings.llm = OpenAI(
    model="gpt-4o",
    api_key=os.getenv("OPENAI_API_KEY")
)

print("Loading documents from ./documents/ ...")
documents = SimpleDirectoryReader(
    input_dir="./documents",
    recursive=True,
    required_exts=[".pdf", ".txt", ".md"]
).load_data()

print(f"Loaded {len(documents)} document chunks")
print("Building vector index (this may take a minute)...")

index = VectorStoreIndex.from_documents(
    documents,
    show_progress=True
)

# Save the index to disk so we don't rebuild every time
index.storage_context.persist(persist_dir="./index_storage")
print("✓ Index saved to ./index_storage")
print(f"\nDocuments indexed:")
sources = set(d.metadata.get('file_name', 'unknown') for d in documents)
for s in sorted(sources):
    print(f"  - {s}")

Run it: python ingest.py. For 10–20 PDF pages, this takes about 30 seconds and costs a few cents in OpenAI embedding API calls. Re-run whenever you add or update documents.

Step 3 — The Query Agent
This is the interactive agent. It loads the saved index and answers questions with citations.
agent.py — interactive Q&A agent with citations
import os
from dotenv import load_dotenv
from llama_index.core import StorageContext, load_index_from_storage, Settings
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.response_synthesizers import get_response_synthesizer
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

load_dotenv()

Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    api_key=os.getenv("OPENAI_API_KEY")
)
Settings.llm = OpenAI(
    model="gpt-4o",
    api_key=os.getenv("OPENAI_API_KEY"),
    system_prompt="""You are a helpful document assistant. Answer questions 
based ONLY on the provided context. If the answer is not in the context, 
say "I couldn't find information about this in the documents." 
Always cite the source document and section when answering.
Format citations as: [Source: filename, page X]"""
)

# Load the pre-built index from disk
print("Loading document index...")
storage_context = StorageContext.from_defaults(persist_dir="./index_storage")
index = load_index_from_storage(storage_context)

# Configure retriever — top_k=5 means retrieve 5 most relevant chunks
retriever = VectorIndexRetriever(index=index, similarity_top_k=5)

# Response synthesizer formats and cites the answer
response_synthesizer = get_response_synthesizer(
    response_mode="compact",
    verbose=False
)

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

def ask(question: str) -> str:
    """Query the document index and return an answer with citations."""
    response = query_engine.query(question)

    # Extract source citations
    sources = []
    if hasattr(response, 'source_nodes'):
        for node in response.source_nodes:
            filename = node.metadata.get('file_name', 'unknown')
            page     = node.metadata.get('page_label', node.metadata.get('page', '?'))
            score    = round(node.score, 3) if node.score else None
            source   = f"{filename} (page {page})"
            if score:
                source += f" [relevance: {score}]"
            if source not in sources:
                sources.append(source)

    answer = str(response)
    if sources:
        answer += f"\n\n📎 Sources: {' | '.join(sources)}"
    return answer

def main():
    print("\n" + "="*60)
    print("Document Q&A Agent — Ready")
    print("="*60)
    print("Ask questions about your documents.")
    print("Type 'quit' to exit.\n")

    while True:
        question = input("Question: ").strip()
        if question.lower() in ('quit', 'exit', 'q'):
            break
        if not question:
            continue

        print("\nSearching documents...")
        answer = ask(question)
        print(f"\nAnswer: {answer}\n")
        print("-" * 40 + "\n")

if __name__ == "__main__":
    main()
Step 4 — Run and Test
bash
# Step 1: Ingest documents (run once)
python ingest.py

# Step 2: Start the Q&A agent
python agent.py

Example interaction with an insurance policy document:

Sample session
Question: What is the deductible for collision coverage?

Searching documents...

Answer: According to the policy documents, the collision coverage 
deductible is $500 for standard auto policies. However, if you 
selected the "low deductible" option at enrollment, it may be 
reduced to $250.

📎 Sources: auto-policy-2024.pdf (page 12) [relevance: 0.891] | 
            coverage-summary.pdf (page 3) [relevance: 0.743]

----------------------------------------

Question: Does the policy cover rental car costs after an accident?

Searching documents...

Answer: Yes, rental reimbursement coverage is included if you 
added the "Transportation Expense" endorsement. The policy covers 
up to $30/day and $900 total per claim while your vehicle is being 
repaired.

📎 Sources: auto-policy-2024.pdf (page 18) [relevance: 0.912]

----------------------------------------

Question: What is the maximum age to purchase a life insurance policy?

Searching documents...

Answer: I couldn't find information about this in the documents. 
Your documents appear to cover auto and home insurance — this 
question may be about a product not covered in the indexed files.

Pro tip: The last example shows the agent correctly saying "I don't know" instead of hallucinating. This is the key benefit of RAG — the agent only answers from what's actually in your documents.

Extend It

Web interface (Streamlit)

Add pip install streamlit and create a app.py with a simple chat UI. Run with streamlit run app.py and share the URL with your team. Takes about 20 lines of code.

Auto-update when documents change

Add a file watcher using watchdog that re-runs ingest.py automatically when any file in the documents/ folder is added, changed, or deleted.

Multi-document filtering

Tag each document with metadata (e.g. "product: home-insurance", "year: 2024") and add filters to the retriever so users can ask "according to the 2024 policy..." and only search the relevant subset.

Conversation memory

Replace the basic query engine with LlamaIndex's CondensePlusContextChatEngine to support follow-up questions in a conversation: "What's the deductible?" → "How do I lower it?" (it remembers you were asking about deductibles).