Skip to main content

RAG Agents: Retrieval-Augmented Answers

Another email lands in Julie's inbox: "How many vacation days do new employees get?" The answer exists — it's in Beacon & Co.'s employee handbook, a PDF sitting in a shared drive. But that's not a database Aria can write SQL against, and it's definitely not something the model knew about during training. It's a private, unstructured document.

This article teaches Aria to search exactly that kind of source. The technique is called RAG — Retrieval-Augmented Generation — and despite the intimidating name, it's built from concepts that are genuinely approachable once unpacked one at a time.

🟡 Skill level: Intermediate.

Quick Reference

When to use this: Whenever an agent needs to answer questions grounded in your own documents — PDFs, internal docs, anything not in the model's training data and not structured enough for a database.

Basic syntax:

from langchain_community.document_loaders import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_core.vectorstores import InMemoryVectorStore

data = PyPDFLoader("handbook.pdf").load()
chunks = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200).split_documents(data)

vector_store = InMemoryVectorStore(OpenAIEmbeddings(model="text-embedding-3-large"))
vector_store.add_documents(documents=chunks)

results = vector_store.similarity_search("your question here")

Common patterns:

  • Load a document → split it into smaller chunks → convert chunks to embeddings → store them → search by meaning, not exact keywords
  • The agent never reads the whole document — only the most relevant chunk(s) retrieved for a specific question
  • Wrap retrieval as a tool, the same @tool pattern from Article 2

Gotchas:

  • ⚠️ Retrieval can fail to find the right chunk, especially for vague queries — this isn't a guarantee of correctness, it's a best-effort search.
  • ⚠️ The agent only ever "sees" the specific chunks retrieved for a given question, never the entire document at once.

See also: SQL Agents: Querying Databases in Natural Language

What You Need to Know First

  • Everything from Article 2 — tool definition and the agent decision-making process
  • No new API keys are needed — we'll reuse the OpenAI key from Article 1

What We'll Cover in This Article

  • What RAG is and why it's needed for documents like a PDF handbook
  • How to load and split a document into searchable pieces
  • What an embedding is, and how it enables "search by meaning"
  • How to wrap document search as a tool Aria can use

What We'll Explain Along the Way

  • Why long documents get split into smaller chunks before searching
  • What an embedding actually is, conceptually
  • What a vector store does with those embeddings

What Is RAG, and Why Do We Need It?

You've already seen one version of this problem in Article 3: a model can't know things it was never shown. Web search fixed that for current events. But Beacon & Co.'s employee handbook has a different problem entirely — it's not that the information is too recent, it's that it's private. It was never on the public internet, so no amount of web searching will ever find it. It only exists in one PDF file.

RAG (Retrieval-Augmented Generation) is the general technique for handling exactly this: take your own documents, make them searchable, and let the agent retrieve relevant pieces before generating an answer — hence the name. Think of it like the difference between asking a new employee a policy question off the top of their head (they'll guess) versus handing them the actual employee handbook and asking them to find the relevant section first (they'll quote it accurately). RAG is the second approach, automated.

The tricky part — and the part with new vocabulary — is how a computer finds "the relevant section" without literally reading the whole document for every single question. That's what the rest of this article builds, piece by piece.

Loading the Document

First, let's load the PDF itself:

uv add pypdf
# Purpose: Load a PDF document into a format LangChain can work with
# Context: The first step of the RAG pipeline — getting the raw text out of the file
# Input: A path to a PDF file
# Output: A list of "page" documents, one per page, with extracted text

from dotenv import load_dotenv
load_dotenv()

from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("employee_handbook.pdf")
data = loader.load()

print(len(data), "pages loaded")

PyPDFLoader extracts the text from each page of the PDF, giving you back a list of page-level documents. That's a start, but a single page can still be a lot of text — too much to search effectively in one piece, which brings us to the next step.

Splitting Into Chunks (and Why That's Necessary)

Language models can only consider a limited amount of text at once for any given request — this limit is called a context window. Searching by handing over an entire 20-page handbook for every single question would be slow, expensive, and often less accurate, since the genuinely relevant sentence gets buried among everything else.

The fix is to split the document into smaller, overlapping chunks — small enough to search efficiently, but with enough surrounding context that meaning isn't lost at the boundaries.

# Purpose: Split loaded pages into smaller, searchable chunks
# Context: Chunks are small enough to search efficiently, with overlap to
# avoid losing meaning at chunk boundaries
# Input: The page-level documents loaded above
# Output: A larger list of smaller chunk documents

from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000, # roughly how many characters per chunk
chunk_overlap=200, # how much consecutive chunks overlap
add_start_index=True, # track where each chunk started in the original text
)

all_splits = text_splitter.split_documents(data)

print(len(all_splits), "chunks created")

The chunk_overlap=200 matters more than it might look: without overlap, a sentence that happens to fall right at the boundary between two chunks could get split in half, with neither chunk containing the complete thought. Overlapping chunks means that boundary sentence appears intact in at least one of them.

Turning Text into Searchable "Fingerprints": Embeddings

Here's the core idea of RAG: we need a way to find the chunk that's most relevant to a question, even if the question doesn't use the exact same words as the document. "How much time off do I get?" should still find a chunk about "PTO Policy," even though none of those words match exactly.

This is done with embeddings: a model that converts a piece of text into a list of numbers (called a vector) that captures its meaning. Texts with similar meaning end up with similar number-lists, even if the actual words are completely different. Think of it like a fingerprint for meaning — two passages about vacation time will have "fingerprints" that are close to each other in this number-space, while a passage about NDAs will have a very different one, regardless of the specific words each uses.

# Purpose: Set up an embeddings model to convert text into meaning-vectors
# Context: This is what makes "search by meaning" possible
# Input: N/A — this just configures the embeddings model
# Output: An embeddings object we'll use to build a searchable store

from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

Now we need somewhere to store every chunk along with its embedding, and to search through them efficiently. That's a vector store:

# Purpose: Create a searchable store of chunks and their embeddings
# Context: InMemoryVectorStore keeps everything in memory while the program runs
# Input: The chunks created earlier
# Output: A vector store ready to be searched by meaning

from langchain_core.vectorstores import InMemoryVectorStore

vector_store = InMemoryVectorStore(embeddings)
vector_store.add_documents(documents=all_splits)

add_documents runs every chunk through the embeddings model and stores the resulting vectors alongside the original text — this is the "indexing" step, and it's why InMemoryVectorStore needs the embeddings object passed in when it's created.

Testing Semantic Search Directly

Before wiring anything into Aria, let's confirm search-by-meaning actually works, the same way we tested tools directly in earlier articles:

# Purpose: Confirm semantic search finds relevant content, even without
# exact keyword matches
# Context: Tests the vector store directly, before involving an agent
# Input: A natural-language question
# Output: The most relevant chunk(s) found, ranked by meaning similarity

results = vector_store.similarity_search(
"How many days of vacation does an employee get in their first year?"
)

print(results[0].page_content)

You should see a chunk containing the actual PTO policy — something like "Full-time employees accrue PTO according to the following schedule: 0–1 years of service: 10 days per year" — found by meaning, not by matching the literal word "vacation" against the document (which may say "PTO" instead).

Wrapping It as a Tool for Aria

Now let's give Aria this ability, using the exact same @tool pattern from every previous article:

# Purpose: Let Aria search the employee handbook to answer policy questions
# Context: Same @tool pattern as always — the body just does a vector search
# Input: A natural-language question
# Output: The most relevant chunk of the handbook for that question

from langchain.tools import tool

@tool
def search_handbook(query: str) -> str:
"""Search the Beacon & Co. employee handbook for policy information."""
results = vector_store.similarity_search(query)
return results[0].page_content
# Purpose: Let Aria answer a real policy question using the handbook
# Context: Continues the running scenario — answering a colleague's question
# Input: A plain-English HR question
# Output: An answer grounded in the real handbook content

from langchain.agents import create_agent
from langchain.messages import HumanMessage

agent = create_agent(
model="gpt-5-nano",
tools=[search_handbook],
system_prompt="You are a helpful agent that can search the Beacon & Co. "
"employee handbook for information.",
)

question = HumanMessage(
content="How many days of vacation does an employee get in their first year?"
)

response = agent.invoke({"messages": [question]})
print(response["messages"][-1].content)

Aria should answer accurately — 10 days per year in the first year of service — grounded in the actual retrieved chunk, not a guess.

Common Misconceptions

❌ Misconception: The model "read" and memorized the entire handbook

Reality: The model only ever sees the specific chunk(s) returned by similarity_search for one particular question — never the whole document at once.

Why this matters: Ask a question that requires combining information scattered across many unrelated sections, and a single retrieved chunk may not be enough — this is a real limitation of basic RAG, not a guarantee of comprehensive understanding.

❌ Misconception: Semantic search always finds the right chunk

Reality: Retrieval is a best-effort similarity search, not a guarantee. A vague or unusually phrased question can retrieve an irrelevant chunk, especially in a large or repetitive document.

Why this matters: For anything important, it's worth checking what was actually retrieved (as we did with results[0].page_content above) rather than blindly trusting the final answer.

Troubleshooting Common Issues

Problem: similarity_search returns an irrelevant chunk

Symptoms: The retrieved chunk doesn't actually relate to the question asked.

Common Causes:

  1. The question is too vague or uses very different terminology than the document (most common)
  2. chunk_size is too large or small for the document's structure, splitting related content awkwardly
  3. The document doesn't actually contain an answer to the question at all

Diagnostic Steps:

# Step 1: Inspect more than just the top result
results = vector_store.similarity_search(query, k=3)
for r in results:
print(r.page_content[:200], "\n---")

# Step 2: Try rephrasing the query closer to the document's likely wording

Solution: Try a handful of phrasings of the same question, and consider returning more than one result to the agent (k=3 instead of just the top match) so it has more context to choose from.

Prevention: For documents with very distinct sections, experimenting with chunk_size can meaningfully improve retrieval quality — there's no single correct value for every document.

Problem: PyPDFLoader returns empty or garbled text

Symptoms: data loads, but page_content is empty or full of nonsense characters.

Common Causes:

  1. The PDF is a scanned image rather than real selectable text — PyPDFLoader extracts existing text, it doesn't perform image recognition
  2. The PDF uses an unusual encoding or embedded fonts that don't extract cleanly

Solution: Confirm you can select and copy text from the PDF in a regular PDF viewer first — if you can't, PyPDFLoader won't be able to either, and you'd need a different tool (OCR) entirely, which is outside the scope of this article.

Check Your Understanding

Quick Quiz

  1. Why can't web search (Article 3) solve the "Aria needs to read the employee handbook" problem?

    Show Answer

    Because the handbook is a private document that was never on the public internet — web search can only find publicly accessible information, while this is a problem of access, not recency.

  2. What does an embedding actually represent?

    Show Answer

    A list of numbers representing the meaning of a piece of text, generated by an embeddings model. Text with similar meaning produces similar number-lists, even when the actual wording is very different — this is what enables search by meaning instead of exact keyword matching.

  3. Why does the document get split into overlapping chunks instead of searched as one whole document?

    Show Answer

    Models can only consider a limited amount of text at once (the context window), and searching a whole long document for every question is inefficient and less accurate. Overlap between chunks prevents a relevant sentence from being awkwardly split across a chunk boundary.

Hands-On Exercise

Challenge: Modify search_handbook to return the top 2 chunks instead of just 1, joined together, so the agent has more context for questions that might span more than one relevant passage.

Show Solution
from langchain.tools import tool

@tool
def search_handbook(query: str) -> str:
"""Search the Beacon & Co. employee handbook for policy information."""
results = vector_store.similarity_search(query, k=2)
return "\n\n---\n\n".join(r.page_content for r in results)

Explanation: Passing k=2 retrieves the top 2 most relevant chunks instead of just 1, joined with a clear separator — useful when a single chunk might not contain the complete answer.

Summary: Key Takeaways

  • RAG (Retrieval-Augmented Generation) lets an agent answer questions grounded in your own private documents, not just public training data
  • Documents get split into smaller, overlapping chunks so they can be searched efficiently
  • An embedding converts text into a list of numbers representing its meaning — similar meaning produces similar embeddings
  • A vector store holds chunks alongside their embeddings and supports search by meaning, not just exact keywords
  • The agent only ever sees the specific chunk(s) retrieved for one question — never the whole document
  • Aria can now answer real policy questions, grounded in Beacon & Co.'s actual employee handbook

Version Information

Tested with:

  • Python: >=3.10, <4.0
  • langchain: >=1.1.3 (latest stable as of writing: 1.3.4)
  • langchain-community: >=0.4.1PyPDFLoader
  • pypdf: >=6.12.0 — required by PyPDFLoader for actual PDF parsing
  • langchain-text-splitters: >=1.0.0RecursiveCharacterTextSplitter
  • langchain-core: >=1.3.3InMemoryVectorStore

Known issues:

  • ⚠️ InMemoryVectorStore rebuilds its index from scratch every time your program runs — for a large document or frequent restarts, a persistent vector store would be more efficient, but that's outside the scope of this introductory article.

What's Next?

You now understand how to make an unstructured document searchable by meaning, and how to wrap that search as a tool.

The natural next step is Multi-Agent Systems: Subagents and Delegation — every agent so far has been one agent with a growing set of tools. That article covers what happens when a task is big enough to genuinely need a team of specialized agents working together.

References