Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

7.1 The Knowledge Problem

Imagine asking a brilliant friend about yesterday’s news. No matter how smart they are, if they’ve been offline for a month or just too busy with other matters, they can’t help you. Large language models face a similar challenge—they’re frozen in time at the point of creation, trained on data from months or years ago. They also can’t access your private documents, your company’s database, or specialized knowledge bases.

This is where Retrieval Augmented Generation (RAG) comes in. Instead of relying solely on a model’s parametric memory (knowledge baked into its weights), we augment it with retrieval: fetching relevant information from external sources and providing it as context.

Think of RAG as giving your AI a library card and teaching it to look things up.

7.2 The Basic RAG Pipeline

A RAG system has three core components:

  1. Indexing: Convert documents into searchable embeddings

  2. Retrieval: Find relevant documents for a query

  3. Generation: Use retrieved context to answer the query

Let’s build this step by step.

7.2.1 Setting Up Our Environment

import ollama
import numpy as np
from typing import List, Dict

def get_embedding(text: str, model: str = "nomic-embed-text") -> List[float]:
    """Get embedding vector for text."""
    response = ollama.embeddings(model=model, prompt=text)
    return response['embedding']

def cosine_similarity(a: List[float], b: List[float]) -> float:
    """Calculate cosine similarity between two vectors."""
    a, b = np.array(a), np.array(b)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

class SimpleDocumentStore:
    def __init__(self):
        self.documents = []
        self.embeddings = []
    
    def add_document(self, text: str, metadata: Dict = None):
        """Add a document and compute its embedding."""
        self.documents.append({"text": text, "metadata": metadata or {}})
        embedding = get_embedding(text)
        self.embeddings.append(embedding)
    
    def search(self, query: str, top_k: int = 3) -> List[Dict]:
        """Find most relevant documents."""
        query_emb = get_embedding(query)
        scores = [cosine_similarity(query_emb, doc_emb) 
                  for doc_emb in self.embeddings]
        
        # Get top-k indices
        top_indices = np.argsort(scores)[-top_k:][::-1]
        return [self.documents[i] for i in top_indices]

7.3 Your First RAG Application

Let’s build a simple RAG system for course syllabi—something every student wishes existed!

# Initialize document store
store = SimpleDocumentStore()

# Add some course information
store.add_document(
    "CS 101 office hours are Monday and Wednesday 2-4pm in Room 305.",
    {"course": "CS101", "type": "logistics"}
)
store.add_document(
    "The final project for CS 101 is due December 15th and worth 40% of your grade.",
    {"course": "CS101", "type": "assessment"}
)
store.add_document(
    "CS 101 covers Python basics, data structures, and algorithm fundamentals.",
    {"course": "CS101", "type": "content"}
)

def rag_query(question: str, store: SimpleDocumentStore) -> str:
    """Answer question using RAG."""
    # Retrieve relevant documents
    relevant_docs = store.search(question, top_k=2)
    
    # Build context from retrieved documents
    context = "\n\n".join([doc["text"] for doc in relevant_docs])
    
    # Create prompt with context
    prompt = f"""Based on the following information:

{context}

Question: {question}

Answer the question using only the information provided above."""

    # Generate response
    response = ollama.generate(model="qwen2.5-coder:latest", prompt=prompt)
    return response['response']

# Test it
print(rag_query("When is the final project due?", store))
The final project for CS 101 is due December 15th.

Notice how the model only uses retrieved information—it doesn’t hallucinate dates or make up policies.

7.4 Chunking: Breaking Down Documents

Real documents are too long to embed as single units. We need to break them into meaningful chunks. This is trickier than it sounds!

7.4.1 Naive Chunking

def chunk_by_sentences(text: str, sentences_per_chunk: int = 3) -> List[str]:
    """Simple sentence-based chunking."""
    sentences = text.replace('!', '.').replace('?', '.').split('.')
    sentences = [s.strip() for s in sentences if s.strip()]
    
    chunks = []
    for i in range(0, len(sentences), sentences_per_chunk):
        chunk = '. '.join(sentences[i:i+sentences_per_chunk]) + '.'
        chunks.append(chunk)
    return chunks

7.4.2 Overlapping Chunks

Better results often come from overlapping chunks—this preserves context across boundaries.

def chunk_with_overlap(text: str, chunk_size: int = 200, 
                       overlap: int = 50) -> List[str]:
    """Create overlapping chunks by character count."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        chunks.append(text[start:end])
        start += (chunk_size - overlap)
    return chunks

7.4.3 Semantic Chunking

The most sophisticated approach: chunk at natural semantic boundaries.

def semantic_chunk(text: str, similarity_threshold: float = 0.7) -> List[str]:
    """Chunk text at semantic boundaries."""
    sentences = text.split('. ')
    if len(sentences) < 2:
        return [text]
    
    chunks = [sentences[0]]
    current_chunk = [sentences[0]]
    
    for i in range(1, len(sentences)):
        # Compare similarity between current and next sentence
        prev_emb = get_embedding(sentences[i-1])
        curr_emb = get_embedding(sentences[i])
        similarity = cosine_similarity(prev_emb, curr_emb)
        
        if similarity < similarity_threshold:
            # Start new chunk at semantic boundary
            chunks.append('. '.join(current_chunk) + '.')
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])
    
    if current_chunk:
        chunks.append('. '.join(current_chunk) + '.')
    
    return chunks

7.5 Advanced Retrieval Strategies

Simple cosine similarity is just the beginning. Let’s explore more sophisticated retrieval methods.

7.5.1 Hybrid Search: Combining Dense and Sparse Retrieval

Dense retrieval (embeddings) captures semantic meaning. Sparse retrieval (keyword matching) captures exact terms. Combining them gives us the best of both worlds.

class HybridSearchStore(SimpleDocumentStore):
    def keyword_score(self, query: str, document: str) -> float:
        """Simple keyword matching score (BM25-like)."""
        query_terms = set(query.lower().split())
        doc_terms = document.lower().split()
        
        # Term frequency
        matches = sum(1 for term in doc_terms if term in query_terms)
        return matches / len(doc_terms) if doc_terms else 0
    
    def hybrid_search(self, query: str, top_k: int = 3, 
                      alpha: float = 0.5) -> List[Dict]:
        """Combine semantic and keyword search."""
        # Semantic scores
        query_emb = get_embedding(query)
        semantic_scores = [cosine_similarity(query_emb, emb) 
                          for emb in self.embeddings]
        
        # Keyword scores
        keyword_scores = [self.keyword_score(query, doc["text"]) 
                         for doc in self.documents]
        
        # Combine with alpha weighting
        combined_scores = [
            alpha * sem + (1 - alpha) * key
            for sem, key in zip(semantic_scores, keyword_scores)
        ]
        
        top_indices = np.argsort(combined_scores)[-top_k:][::-1]
        return [self.documents[i] for i in top_indices]

7.5.2 Metadata Filtering

Sometimes we want to constrain our search by metadata—like searching only within a specific course or time period.

def filtered_search(store: SimpleDocumentStore, query: str, 
                   filters: Dict, top_k: int = 3) -> List[Dict]:
    """Search with metadata filtering."""
    # First, filter by metadata
    filtered_indices = []
    for i, doc in enumerate(store.documents):
        match = all(doc["metadata"].get(k) == v 
                   for k, v in filters.items())
        if match:
            filtered_indices.append(i)
    
    if not filtered_indices:
        return []
    
    # Then, semantic search within filtered set
    query_emb = get_embedding(query)
    scores = [(i, cosine_similarity(query_emb, store.embeddings[i])) 
              for i in filtered_indices]
    scores.sort(key=lambda x: x[1], reverse=True)
    
    return [store.documents[i] for i, _ in scores[:top_k]]

7.6 Query Transformation Techniques

Users don’t always ask questions in the optimal way for retrieval. Query transformation helps bridge this gap.

7.6.1 Query Expansion with LLM

def expand_query(original_query: str) -> List[str]:
    """Generate alternative phrasings of a query."""
    prompt = f"""Generate 3 alternative ways to ask this question,
each focusing on different aspects or using different terminology:

Original: {original_query}

Alternatives (one per line):"""
    
    response = ollama.generate(model="qwen2.5-coder:latest", prompt=prompt)
    alternatives = [original_query]  # Include original
    alternatives.extend(response['response'].strip().split('\n'))
    return [q.strip('- ').strip() for q in alternatives if q.strip()]

# Example usage
query = "What are the prerequisites for the AI course?"
expanded = expand_query(query)
print(expanded)
# Search with all variations and combine results
['What are the prerequisites for the AI course?', '1. Could you please list the requirements for the AI course?', '2. What essential knowledge or skills are needed before enrolling in the AI course?', '3. What should students know or do to be prepared for the AI course?']

7.6.2 Hypothetical Document Embeddings (HyDE)

Instead of embedding the query directly, generate a hypothetical answer and embed that. This often better matches document embeddings.

def hyde_retrieval(query: str, store: SimpleDocumentStore, 
                   top_k: int = 3) -> List[Dict]:
    """Retrieve using hypothetical document embeddings."""
    # Generate hypothetical answer
    prompt = f"Write a brief, factual answer to: {query}"
    response = ollama.generate(model="qwen2.5-coder:latest", prompt=prompt)
    hypothetical_doc = response['response']
    
    # Embed and search with hypothetical answer
    hyp_emb = get_embedding(hypothetical_doc)
    scores = [cosine_similarity(hyp_emb, doc_emb) 
              for doc_emb in store.embeddings]
    
    top_indices = np.argsort(scores)[-top_k:][::-1]
    return [store.documents[i] for i in top_indices]

7.7 Context Management and Prompt Engineering

Retrieving documents is only half the battle. We need to present them effectively to the LLM.

7.7.1 Reranking Retrieved Documents

Not all retrieved documents are equally relevant. Reranking refines our initial retrieval.

def rerank_with_llm(query: str, documents: List[Dict]) -> List[Dict]:
    """Use LLM to rerank documents by relevance."""
    scores = []
    
    for doc in documents:
        prompt = f"""On a scale of 0-10, how relevant is this document to the query?

Query: {query}
Document: {doc['text']}

Respond with only a number 0-10:"""
        
        response = ollama.generate(model="qwen2.5-coder:latest", prompt=prompt)
        try:
            score = float(response['response'].strip())
        except ValueError:
            score = 5.0  # Default if parsing fails
        scores.append(score)
    
    # Sort by reranked scores
    ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
    return [doc for doc, _ in ranked]

7.7.2 Citation and Source Attribution

Good RAG systems cite their sources—crucial for trust and verification.

def rag_with_citations(question: str, store: SimpleDocumentStore) -> str:
    """RAG with source citations."""
    relevant_docs = store.search(question, top_k=3)
    
    # Number each document
    context_parts = []
    for i, doc in enumerate(relevant_docs, 1):
        context_parts.append(f"[{i}] {doc['text']}")
    context = "\n\n".join(context_parts)
    
    prompt = f"""Answer the question using the provided sources.
Cite sources using [1], [2], etc.

Sources:
{context}

Question: {question}

Answer with citations:"""
    
    response = ollama.generate(model="qwen2.5-coder:latest", prompt=prompt)
    
    # Return answer with source texts
    answer = response['response']
    sources = "\n\n".join([f"[{i}] {doc['text']}" 
                          for i, doc in enumerate(relevant_docs, 1)])
    return f"{answer}\n\n---\nSources:\n{sources}"

7.8 Handling Multi-Turn Conversations

RAG becomes more complex with conversation history. We need to maintain context across turns.

class ConversationalRAG:
    def __init__(self, store: SimpleDocumentStore):
        self.store = store
        self.history = []
    
    def query(self, user_message: str) -> str:
        """Handle conversational query with history."""
        # Build conversation context
        history_context = "\n".join([
            f"User: {h['user']}\nAssistant: {h['assistant']}"
            for h in self.history[-3:]  # Last 3 turns
        ])
        
        # Retrieve relevant documents
        relevant_docs = self.store.search(user_message, top_k=2)
        doc_context = "\n\n".join([doc["text"] for doc in relevant_docs])
        
        prompt = f"""Previous conversation:
{history_context}

Relevant information:
{doc_context}

User: {user_message}
Assistant:"""
        
        response = ollama.generate(model="qwen2.5-coder:latest", prompt=prompt)
        answer = response['response']
        
        # Update history
        self.history.append({"user": user_message, "assistant": answer})
        return answer

# Example usage
conv_rag = ConversationalRAG(store)
print(conv_rag.query("When are office hours?"))
print(conv_rag.query("Where exactly?"))  # Follows up on previous question
Office hours for CS 101 are on Mondays and Wednesdays from 2-4pm in Room 305.
Office hours for CS 101 are on Mondays and Wednesdays from 2-4pm in Room 305.

7.12 Building a Complete RAG Application

Let’s bring everything together into a “production-ready” system.

class ProductionRAGSystem:
    def __init__(self, model: str = "qwen2.5-coder:latest"):
        self.store = HybridSearchStore()
        self.model = model
        self.query_cache = {}
    
    def ingest_documents(self, documents: List[str], 
                        metadata: List[Dict] = None):
        """Ingest and chunk documents."""
        for i, doc in enumerate(documents):
            chunks = chunk_with_overlap(doc, chunk_size=300, overlap=50)
            doc_metadata = metadata[i] if metadata else {}
            
            for j, chunk in enumerate(chunks):
                chunk_meta = {**doc_metadata, "chunk_id": j}
                self.store.add_document(chunk, chunk_meta)
    
    def query(self, question: str, use_cache: bool = True, 
              top_k: int = 3) -> Dict:
        """Main query interface with caching."""
        # Check cache
        if use_cache and question in self.query_cache:
            return self.query_cache[question]
        
        # Retrieve relevant documents
        relevant_docs = self.store.hybrid_search(question, top_k=top_k)
        
        # Rerank
        relevant_docs = rerank_with_llm(question, relevant_docs)
        
        # Build prompt with context
        context = "\n\n".join([
            f"[{i+1}] {doc['text']}" 
            for i, doc in enumerate(relevant_docs)
        ])
        
        prompt = f"""Answer using the provided context. Cite sources with [1], [2], etc.

Context:
{context}

Question: {question}

Answer:"""
        
        response = ollama.generate(model=self.model, prompt=prompt)
        
        result = {
            "answer": response['response'],
            "sources": relevant_docs,
            "context": context
        }
        
        # Cache result
        if use_cache:
            self.query_cache[question] = result
        
        return result

# Example usage
rag = ProductionRAGSystem()
rag.ingest_documents([
    "Python was created by Guido van Rossum in 1991. It emphasizes code readability.",
    "Python supports multiple programming paradigms including procedural, OOP, and functional.",
    "Python's standard library is extensive, covering file I/O, networking, and more."
])

result = rag.query("Who created Python?")
print(result["answer"])
print("\nSources:")
for i, source in enumerate(result["sources"], 1):
    print(f"[{i}] {source['text'][:100]}...")
Python was created by Guido van Rossum in 1991. [1]

Sources:
[1] Python was created by Guido van Rossum in 1991. It emphasizes code readability....
[2] Python's standard library is extensive, covering file I/O, networking, and more....
[3] Python supports multiple programming paradigms including procedural, OOP, and functional....