Build a RAG pipeline from scratch in Python without LangChain
Table of Contents
Most RAG tutorials dump you into LangChain abstractions before you understand what’s actually happening. That’s backwards. If you want to build a RAG pipeline in Python that you can debug, optimize, and extend, you need to understand each component. This guide walks through the entire process — chunking, embedding, storing, retrieving, and generating — using only basic libraries.
What you’re building
A Retrieval Augmented Generation system that:
- Splits documents into chunks
- Converts chunks to vector embeddings via OpenAI
- Stores vectors in ChromaDB (a local vector database)
- Retrieves relevant chunks based on a query
- Passes context + query to Claude for a grounded answer
Here’s the folder structure:
rag-project/
├── documents/
│ └── sample.txt
├── rag.py
└── requirements.txt
Step 1: Install dependencies and set up API keys
pip install openai anthropic chromadb tiktoken
Create a .env file or export these in your shell:
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
In rag.py, load them:
import os
from openai import OpenAI
from anthropic import Anthropic
import chromadb
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
anthropic_client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
chroma_client = chromadb.PersistentClient(path="./chroma_db")
Step 2: Chunk your documents
Chunking strategy matters more than most people realize. Overlapping chunks preserve context at boundaries. Here’s a simple character-based chunker:
def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
"""Split text into overlapping chunks."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunk = text[start:end]
chunks.append(chunk.strip())
start = end - overlap
return [c for c in chunks if c] # Remove empty chunks
def load_documents(directory: str) -> list[dict]:
"""Load all .txt files and chunk them."""
all_chunks = []
for filename in os.listdir(directory):
if filename.endswith(".txt"):
filepath = os.path.join(directory, filename)
with open(filepath, "r", encoding="utf-8") as f:
text = f.read()
chunks = chunk_text(text)
for i, chunk in enumerate(chunks):
all_chunks.append({
"id": f"{filename}_{i}",
"text": chunk,
"source": filename
})
return all_chunks
For production, consider sentence-based or semantic chunking. But character-based works fine for getting started.
Step 3: Generate embeddings with OpenAI
OpenAI’s text-embedding-3-small model is cheap and effective. Here’s how to embed your chunks:
def get_embeddings(texts: list[str]) -> list[list[float]]:
"""Generate embeddings for a list of texts."""
response = openai_client.embeddings.create(
model="text-embedding-3-small",
input=texts
)
return [item.embedding for item in response.data]
def index_documents(chunks: list[dict]):
"""Embed chunks and store in ChromaDB."""
collection = chroma_client.get_or_create_collection(
name="documents",
metadata={"hnsw:space": "cosine"}
)
# Batch to avoid rate limits
batch_size = 100
for i in range(0, len(chunks), batch_size):
batch = chunks[i:i + batch_size]
texts = [c["text"] for c in batch]
ids = [c["id"] for c in batch]
metadatas = [{"source": c["source"]} for c in batch]
embeddings = get_embeddings(texts)
collection.add(
ids=ids,
embeddings=embeddings,
documents=texts,
metadatas=metadatas
)
print(f"Indexed {len(chunks)} chunks")
The hnsw:space: cosine setting tells ChromaDB to use cosine similarity for comparisons, which works well with OpenAI embeddings.
Step 4: Retrieve relevant chunks
When a user asks a question, embed their query and find the closest chunks:
def retrieve(query: str, top_k: int = 5) -> list[dict]:
"""Find the most relevant chunks for a query."""
collection = chroma_client.get_collection("documents")
query_embedding = get_embeddings([query])[0]
results = collection.query(
query_embeddings=[query_embedding],
n_results=top_k,
include=["documents", "metadatas", "distances"]
)
retrieved = []
for i in range(len(results["ids"][0])):
retrieved.append({
"text": results["documents"][0][i],
"source": results["metadatas"][0][i]["source"],
"distance": results["distances"][0][i]
})
return retrieved
The distance field helps you debug retrieval quality. Lower values mean closer matches. If your top results have high distances, your chunks might not contain relevant information.
Step 5: Generate answers with Claude
Now combine retrieved context with the user’s question and send to Claude:
def generate_answer(query: str, context_chunks: list[dict]) -> str:
"""Generate an answer using retrieved context."""
context = "\n\n---\n\n".join([
f"[Source: {c['source']}]\n{c['text']}"
for c in context_chunks
])
message = anthropic_client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="You answer questions based only on the provided context. If the context doesn't contain enough information, say so. Cite sources when possible.",
messages=[
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {query}"
}
]
)
return message.content[0].text
def ask(query: str) -> str:
"""Full RAG pipeline: retrieve then generate."""
chunks = retrieve(query, top_k=5)
if not chunks:
return "No relevant documents found."
return generate_answer(query, chunks)
Putting it all together
Here’s the complete workflow:
if __name__ == "__main__":
# One-time indexing
chunks = load_documents("./documents")
index_documents(chunks)
# Query
question = "What are the main points about X?"
answer = ask(question)
print(answer)
Run it:
python rag.py
Key takeaways
- Chunking affects everything — experiment with chunk sizes based on your document type. Code needs smaller chunks; prose can handle larger ones.
- Embeddings are the bottleneck — OpenAI’s API is fast, but for large document sets, consider batching and caching embeddings.
- ChromaDB persists locally — the
PersistentClientsaves your vectors to disk, so you only embed once. - The system prompt shapes output — telling Claude to cite sources and admit uncertainty dramatically improves answer quality.
You now have a working RAG pipeline with no framework magic. From here, you can swap ChromaDB for Pinecone, replace OpenAI embeddings with local models, or add reranking. But you’ll understand exactly what each piece does.