Build a RAG pipeline from scratch in Python without LangChain

Table of Contents

Most RAG tutorials dump you into LangChain abstractions before you understand what’s actually happening. That’s backwards. If you want to build a RAG pipeline in Python that you can debug, optimize, and extend, you need to understand each component. This guide walks through the entire process — chunking, embedding, storing, retrieving, and generating — using only basic libraries.

What you’re building

A Retrieval Augmented Generation system that:

Splits documents into chunks
Converts chunks to vector embeddings via OpenAI
Stores vectors in ChromaDB (a local vector database)
Retrieves relevant chunks based on a query
Passes context + query to Claude for a grounded answer

Here’s the folder structure:

rag-project/
├── documents/
│   └── sample.txt
├── rag.py
└── requirements.txt

Step 1: Install dependencies and set up API keys

pip install openai anthropic chromadb tiktoken

Create a .env file or export these in your shell:

export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."

In rag.py, load them:

import os
from openai import OpenAI
from anthropic import Anthropic
import chromadb

openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
anthropic_client = Anthropic(api_key=os.environ["ANTHROPIC_API_KEY"])
chroma_client = chromadb.PersistentClient(path="./chroma_db")

Step 2: Chunk your documents

Chunking strategy matters more than most people realize. Overlapping chunks preserve context at boundaries. Here’s a simple character-based chunker:

def chunk_text(text: str, chunk_size: int = 500, overlap: int = 50) -> list[str]:
    """Split text into overlapping chunks."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunk = text[start:end]
        chunks.append(chunk.strip())
        start = end - overlap
    return [c for c in chunks if c]  # Remove empty chunks


def load_documents(directory: str) -> list[dict]:
    """Load all .txt files and chunk them."""
    all_chunks = []
    for filename in os.listdir(directory):
        if filename.endswith(".txt"):
            filepath = os.path.join(directory, filename)
            with open(filepath, "r", encoding="utf-8") as f:
                text = f.read()
            chunks = chunk_text(text)
            for i, chunk in enumerate(chunks):
                all_chunks.append({
                    "id": f"{filename}_{i}",
                    "text": chunk,
                    "source": filename
                })
    return all_chunks

For production, consider sentence-based or semantic chunking. But character-based works fine for getting started.

Step 3: Generate embeddings with OpenAI

OpenAI’s text-embedding-3-small model is cheap and effective. Here’s how to embed your chunks:

def get_embeddings(texts: list[str]) -> list[list[float]]:
    """Generate embeddings for a list of texts."""
    response = openai_client.embeddings.create(
        model="text-embedding-3-small",
        input=texts
    )
    return [item.embedding for item in response.data]


def index_documents(chunks: list[dict]):
    """Embed chunks and store in ChromaDB."""
    collection = chroma_client.get_or_create_collection(
        name="documents",
        metadata={"hnsw:space": "cosine"}
    )
    
    # Batch to avoid rate limits
    batch_size = 100
    for i in range(0, len(chunks), batch_size):
        batch = chunks[i:i + batch_size]
        texts = [c["text"] for c in batch]
        ids = [c["id"] for c in batch]
        metadatas = [{"source": c["source"]} for c in batch]
        
        embeddings = get_embeddings(texts)
        
        collection.add(
            ids=ids,
            embeddings=embeddings,
            documents=texts,
            metadatas=metadatas
        )
    
    print(f"Indexed {len(chunks)} chunks")

The hnsw:space: cosine setting tells ChromaDB to use cosine similarity for comparisons, which works well with OpenAI embeddings.

Step 4: Retrieve relevant chunks

When a user asks a question, embed their query and find the closest chunks:

def retrieve(query: str, top_k: int = 5) -> list[dict]:
    """Find the most relevant chunks for a query."""
    collection = chroma_client.get_collection("documents")
    
    query_embedding = get_embeddings([query])[0]
    
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )
    
    retrieved = []
    for i in range(len(results["ids"][0])):
        retrieved.append({
            "text": results["documents"][0][i],
            "source": results["metadatas"][0][i]["source"],
            "distance": results["distances"][0][i]
        })
    
    return retrieved

The distance field helps you debug retrieval quality. Lower values mean closer matches. If your top results have high distances, your chunks might not contain relevant information.

Step 5: Generate answers with Claude

Now combine retrieved context with the user’s question and send to Claude:

def generate_answer(query: str, context_chunks: list[dict]) -> str:
    """Generate an answer using retrieved context."""
    context = "\n\n---\n\n".join([
        f"[Source: {c['source']}]\n{c['text']}" 
        for c in context_chunks
    ])
    
    message = anthropic_client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=1024,
        system="You answer questions based only on the provided context. If the context doesn't contain enough information, say so. Cite sources when possible.",
        messages=[
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {query}"
            }
        ]
    )
    
    return message.content[0].text


def ask(query: str) -> str:
    """Full RAG pipeline: retrieve then generate."""
    chunks = retrieve(query, top_k=5)
    
    if not chunks:
        return "No relevant documents found."
    
    return generate_answer(query, chunks)

Putting it all together

Here’s the complete workflow:

if __name__ == "__main__":
    # One-time indexing
    chunks = load_documents("./documents")
    index_documents(chunks)
    
    # Query
    question = "What are the main points about X?"
    answer = ask(question)
    print(answer)

Run it:

python rag.py

Key takeaways

Chunking affects everything — experiment with chunk sizes based on your document type. Code needs smaller chunks; prose can handle larger ones.
Embeddings are the bottleneck — OpenAI’s API is fast, but for large document sets, consider batching and caching embeddings.
ChromaDB persists locally — the PersistentClient saves your vectors to disk, so you only embed once.
The system prompt shapes output — telling Claude to cite sources and admit uncertainty dramatically improves answer quality.

You now have a working RAG pipeline with no framework magic. From here, you can swap ChromaDB for Pinecone, replace OpenAI embeddings with local models, or add reranking. But you’ll understand exactly what each piece does.