Document Q&A Agent (RAG)

Jump to: What is RAG? Prerequisites Setup Ingest Documents Query Agent Run It Extend It

What is RAG and Why Does It Matter?

RAG = Retrieval-Augmented Generation. The key to accurate, trustworthy AI answers.

Without RAG

LLM answers from training data only
Training data is from 2024 or earlier
Doesn't know your company's documents
Confidently makes up ("hallucinates") details
No citations — can't verify the answer

With RAG

Searches your documents for relevant passages
Feeds those passages into the LLM's context
LLM answers based only on your content
Cites the exact source document and section
Says "I don't know" when the answer isn't in your docs

The RAG pipeline in 3 steps:

1. Index: Load your documents → split into chunks → convert to vectors (embeddings) → store in a vector database
2. Retrieve: User asks a question → convert question to vector → find the most similar chunks in the database
3. Generate: Feed the retrieved chunks + the question to Claude/GPT → get a grounded, cited answer

Prerequisites

✓

Python 3.9+

Check: python3 --version.
✓

OpenAI API key (for embeddings and answers)

Get one at platform.openai.com/api-keys. We use OpenAI for embeddings (text-embedding-3-small is cheap and excellent) and GPT-4o for answers. Alternatively, you can use Anthropic for answers and a local embedding model — the code shows how.
✓

Your documents

Gather the PDFs, Word docs, or text files you want the agent to know about. Put them in a folder called documents/. This example works with PDF and plain text. For Word documents, pip install python-docx and add a loader for it.

Step 1 — Project Setup

bash — create the project

mkdir doc-qa-agent && cd doc-qa-agent
python3 -m venv venv
source venv/bin/activate

pip install openai llama-index llama-index-embeddings-openai \
    llama-index-llms-openai pypdf python-dotenv

# Create the documents directory and add your files
mkdir documents
# Copy your PDFs or text files into the documents/ folder

.env

OPENAI_API_KEY=sk-...your-key...

Step 2 — Ingest Your Documents

This script loads all your documents, splits them into chunks, embeds them, and saves the index to disk. You run it once (or whenever your documents change).

ingest.py — run once to build the index

import os
from dotenv import load_dotenv
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

load_dotenv()

# Configure embedding model and LLM
Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    api_key=os.getenv("OPENAI_API_KEY")
)
Settings.llm = OpenAI(
    model="gpt-4o",
    api_key=os.getenv("OPENAI_API_KEY")
)

print("Loading documents from ./documents/ ...")
documents = SimpleDirectoryReader(
    input_dir="./documents",
    recursive=True,
    required_exts=[".pdf", ".txt", ".md"]
).load_data()

print(f"Loaded {len(documents)} document chunks")
print("Building vector index (this may take a minute)...")

index = VectorStoreIndex.from_documents(
    documents,
    show_progress=True
)

# Save the index to disk so we don't rebuild every time
index.storage_context.persist(persist_dir="./index_storage")
print("✓ Index saved to ./index_storage")
print(f"\nDocuments indexed:")
sources = set(d.metadata.get('file_name', 'unknown') for d in documents)
for s in sorted(sources):
    print(f"  - {s}")

Run it: python ingest.py. For 10–20 PDF pages, this takes about 30 seconds and costs a few cents in OpenAI embedding API calls. Re-run whenever you add or update documents.

Step 3 — The Query Agent

This is the interactive agent. It loads the saved index and answers questions with citations.

agent.py — interactive Q&A agent with citations

import os
from dotenv import load_dotenv
from llama_index.core import StorageContext, load_index_from_storage, Settings
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import VectorIndexRetriever
from llama_index.core.response_synthesizers import get_response_synthesizer
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI

load_dotenv()

Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    api_key=os.getenv("OPENAI_API_KEY")
)
Settings.llm = OpenAI(
    model="gpt-4o",
    api_key=os.getenv("OPENAI_API_KEY"),
    system_prompt="""You are a helpful document assistant. Answer questions 
based ONLY on the provided context. If the answer is not in the context, 
say "I couldn't find information about this in the documents." 
Always cite the source document and section when answering.
Format citations as: [Source: filename, page X]"""
)

# Load the pre-built index from disk
print("Loading document index...")
storage_context = StorageContext.from_defaults(persist_dir="./index_storage")
index = load_index_from_storage(storage_context)

# Configure retriever — top_k=5 means retrieve 5 most relevant chunks
retriever = VectorIndexRetriever(index=index, similarity_top_k=5)

# Response synthesizer formats and cites the answer
response_synthesizer = get_response_synthesizer(
    response_mode="compact",
    verbose=False
)

query_engine = RetrieverQueryEngine(
    retriever=retriever,
    response_synthesizer=response_synthesizer,
)

def ask(question: str) -> str:
    """Query the document index and return an answer with citations."""
    response = query_engine.query(question)

    # Extract source citations
    sources = []
    if hasattr(response, 'source_nodes'):
        for node in response.source_nodes:
            filename = node.metadata.get('file_name', 'unknown')
            page     = node.metadata.get('page_label', node.metadata.get('page', '?'))
            score    = round(node.score, 3) if node.score else None
            source   = f"{filename} (page {page})"
            if score:
                source += f" [relevance: {score}]"
            if source not in sources:
                sources.append(source)

    answer = str(response)
    if sources:
        answer += f"\n\n📎 Sources: {' | '.join(sources)}"
    return answer

def main():
    print("\n" + "="*60)
    print("Document Q&A Agent — Ready")
    print("="*60)
    print("Ask questions about your documents.")
    print("Type 'quit' to exit.\n")

    while True:
        question = input("Question: ").strip()
        if question.lower() in ('quit', 'exit', 'q'):
            break
        if not question:
            continue

        print("\nSearching documents...")
        answer = ask(question)
        print(f"\nAnswer: {answer}\n")
        print("-" * 40 + "\n")

if __name__ == "__main__":
    main()

Step 4 — Run and Test

bash

# Step 1: Ingest documents (run once)
python ingest.py

# Step 2: Start the Q&A agent
python agent.py

Example interaction with an insurance policy document:

Sample session

Question: What is the deductible for collision coverage?

Searching documents...

Answer: According to the policy documents, the collision coverage 
deductible is $500 for standard auto policies. However, if you 
selected the "low deductible" option at enrollment, it may be 
reduced to $250.

📎 Sources: auto-policy-2024.pdf (page 12) [relevance: 0.891] | 
            coverage-summary.pdf (page 3) [relevance: 0.743]

----------------------------------------

Question: Does the policy cover rental car costs after an accident?

Searching documents...

Answer: Yes, rental reimbursement coverage is included if you 
added the "Transportation Expense" endorsement. The policy covers 
up to $30/day and $900 total per claim while your vehicle is being 
repaired.

📎 Sources: auto-policy-2024.pdf (page 18) [relevance: 0.912]

----------------------------------------

Question: What is the maximum age to purchase a life insurance policy?

Searching documents...

Answer: I couldn't find information about this in the documents. 
Your documents appear to cover auto and home insurance — this 
question may be about a product not covered in the indexed files.

Pro tip: The last example shows the agent correctly saying "I don't know" instead of hallucinating. This is the key benefit of RAG — the agent only answers from what's actually in your documents.

Extend It

Web interface (Streamlit)

Add pip install streamlit and create a app.py with a simple chat UI. Run with streamlit run app.py and share the URL with your team. Takes about 20 lines of code.

Auto-update when documents change

Add a file watcher using watchdog that re-runs ingest.py automatically when any file in the documents/ folder is added, changed, or deleted.

Multi-document filtering

Tag each document with metadata (e.g. "product: home-insurance", "year: 2024") and add filters to the retriever so users can ask "according to the 2024 policy..." and only search the relevant subset.

Conversation memory

Replace the basic query engine with LlamaIndex's CondensePlusContextChatEngine to support follow-up questions in a conversation: "What's the deductible?" → "How do I lower it?" (it remembers you were asking about deductibles).

Document Q&A Agent (RAG)

Without RAG

With RAG

Python 3.9+

OpenAI API key (for embeddings and answers)

Your documents

Web interface (Streamlit)

Auto-update when documents change

Multi-document filtering

Conversation memory