7.1 The Knowledge Problem¶
Imagine asking a brilliant friend about yesterday’s news. No matter how smart they are, if they’ve been offline for a month or just too busy with other matters, they can’t help you. Large language models face a similar challenge—they’re frozen in time at the point of creation, trained on data from months or years ago. They also can’t access your private documents, your company’s database, or specialized knowledge bases.
This is where Retrieval Augmented Generation (RAG) comes in. Instead of relying solely on a model’s parametric memory (knowledge baked into its weights), we augment it with retrieval: fetching relevant information from external sources and providing it as context.
Think of RAG as giving your AI a library card and teaching it to look things up.
7.2 The Basic RAG Pipeline¶
A RAG system has three core components:
Indexing: Convert documents into searchable embeddings
Retrieval: Find relevant documents for a query
Generation: Use retrieved context to answer the query
Let’s build this step by step.
7.2.1 Setting Up Our Environment¶
import ollama
import numpy as np
from typing import List, Dict
def get_embedding(text: str, model: str = "nomic-embed-text") -> List[float]:
"""Get embedding vector for text."""
response = ollama.embeddings(model=model, prompt=text)
return response['embedding']
def cosine_similarity(a: List[float], b: List[float]) -> float:
"""Calculate cosine similarity between two vectors."""
a, b = np.array(a), np.array(b)
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
class SimpleDocumentStore:
def __init__(self):
self.documents = []
self.embeddings = []
def add_document(self, text: str, metadata: Dict = None):
"""Add a document and compute its embedding."""
self.documents.append({"text": text, "metadata": metadata or {}})
embedding = get_embedding(text)
self.embeddings.append(embedding)
def search(self, query: str, top_k: int = 3) -> List[Dict]:
"""Find most relevant documents."""
query_emb = get_embedding(query)
scores = [cosine_similarity(query_emb, doc_emb)
for doc_emb in self.embeddings]
# Get top-k indices
top_indices = np.argsort(scores)[-top_k:][::-1]
return [self.documents[i] for i in top_indices]7.3 Your First RAG Application¶
Let’s build a simple RAG system for course syllabi—something every student wishes existed!
# Initialize document store
store = SimpleDocumentStore()
# Add some course information
store.add_document(
"CS 101 office hours are Monday and Wednesday 2-4pm in Room 305.",
{"course": "CS101", "type": "logistics"}
)
store.add_document(
"The final project for CS 101 is due December 15th and worth 40% of your grade.",
{"course": "CS101", "type": "assessment"}
)
store.add_document(
"CS 101 covers Python basics, data structures, and algorithm fundamentals.",
{"course": "CS101", "type": "content"}
)
def rag_query(question: str, store: SimpleDocumentStore) -> str:
"""Answer question using RAG."""
# Retrieve relevant documents
relevant_docs = store.search(question, top_k=2)
# Build context from retrieved documents
context = "\n\n".join([doc["text"] for doc in relevant_docs])
# Create prompt with context
prompt = f"""Based on the following information:
{context}
Question: {question}
Answer the question using only the information provided above."""
# Generate response
response = ollama.generate(model="qwen2.5-coder:latest", prompt=prompt)
return response['response']
# Test it
print(rag_query("When is the final project due?", store))The final project for CS 101 is due December 15th.
Notice how the model only uses retrieved information—it doesn’t hallucinate dates or make up policies.
7.4 Chunking: Breaking Down Documents¶
Real documents are too long to embed as single units. We need to break them into meaningful chunks. This is trickier than it sounds!
7.4.1 Naive Chunking¶
def chunk_by_sentences(text: str, sentences_per_chunk: int = 3) -> List[str]:
"""Simple sentence-based chunking."""
sentences = text.replace('!', '.').replace('?', '.').split('.')
sentences = [s.strip() for s in sentences if s.strip()]
chunks = []
for i in range(0, len(sentences), sentences_per_chunk):
chunk = '. '.join(sentences[i:i+sentences_per_chunk]) + '.'
chunks.append(chunk)
return chunks7.4.2 Overlapping Chunks¶
Better results often come from overlapping chunks—this preserves context across boundaries.
def chunk_with_overlap(text: str, chunk_size: int = 200,
overlap: int = 50) -> List[str]:
"""Create overlapping chunks by character count."""
chunks = []
start = 0
while start < len(text):
end = start + chunk_size
chunks.append(text[start:end])
start += (chunk_size - overlap)
return chunks7.4.3 Semantic Chunking¶
The most sophisticated approach: chunk at natural semantic boundaries.
def semantic_chunk(text: str, similarity_threshold: float = 0.7) -> List[str]:
"""Chunk text at semantic boundaries."""
sentences = text.split('. ')
if len(sentences) < 2:
return [text]
chunks = [sentences[0]]
current_chunk = [sentences[0]]
for i in range(1, len(sentences)):
# Compare similarity between current and next sentence
prev_emb = get_embedding(sentences[i-1])
curr_emb = get_embedding(sentences[i])
similarity = cosine_similarity(prev_emb, curr_emb)
if similarity < similarity_threshold:
# Start new chunk at semantic boundary
chunks.append('. '.join(current_chunk) + '.')
current_chunk = [sentences[i]]
else:
current_chunk.append(sentences[i])
if current_chunk:
chunks.append('. '.join(current_chunk) + '.')
return chunks7.5 Advanced Retrieval Strategies¶
Simple cosine similarity is just the beginning. Let’s explore more sophisticated retrieval methods.
7.5.1 Hybrid Search: Combining Dense and Sparse Retrieval¶
Dense retrieval (embeddings) captures semantic meaning. Sparse retrieval (keyword matching) captures exact terms. Combining them gives us the best of both worlds.
class HybridSearchStore(SimpleDocumentStore):
def keyword_score(self, query: str, document: str) -> float:
"""Simple keyword matching score (BM25-like)."""
query_terms = set(query.lower().split())
doc_terms = document.lower().split()
# Term frequency
matches = sum(1 for term in doc_terms if term in query_terms)
return matches / len(doc_terms) if doc_terms else 0
def hybrid_search(self, query: str, top_k: int = 3,
alpha: float = 0.5) -> List[Dict]:
"""Combine semantic and keyword search."""
# Semantic scores
query_emb = get_embedding(query)
semantic_scores = [cosine_similarity(query_emb, emb)
for emb in self.embeddings]
# Keyword scores
keyword_scores = [self.keyword_score(query, doc["text"])
for doc in self.documents]
# Combine with alpha weighting
combined_scores = [
alpha * sem + (1 - alpha) * key
for sem, key in zip(semantic_scores, keyword_scores)
]
top_indices = np.argsort(combined_scores)[-top_k:][::-1]
return [self.documents[i] for i in top_indices]7.5.2 Metadata Filtering¶
Sometimes we want to constrain our search by metadata—like searching only within a specific course or time period.
def filtered_search(store: SimpleDocumentStore, query: str,
filters: Dict, top_k: int = 3) -> List[Dict]:
"""Search with metadata filtering."""
# First, filter by metadata
filtered_indices = []
for i, doc in enumerate(store.documents):
match = all(doc["metadata"].get(k) == v
for k, v in filters.items())
if match:
filtered_indices.append(i)
if not filtered_indices:
return []
# Then, semantic search within filtered set
query_emb = get_embedding(query)
scores = [(i, cosine_similarity(query_emb, store.embeddings[i]))
for i in filtered_indices]
scores.sort(key=lambda x: x[1], reverse=True)
return [store.documents[i] for i, _ in scores[:top_k]]7.6 Query Transformation Techniques¶
Users don’t always ask questions in the optimal way for retrieval. Query transformation helps bridge this gap.
7.6.1 Query Expansion with LLM¶
def expand_query(original_query: str) -> List[str]:
"""Generate alternative phrasings of a query."""
prompt = f"""Generate 3 alternative ways to ask this question,
each focusing on different aspects or using different terminology:
Original: {original_query}
Alternatives (one per line):"""
response = ollama.generate(model="qwen2.5-coder:latest", prompt=prompt)
alternatives = [original_query] # Include original
alternatives.extend(response['response'].strip().split('\n'))
return [q.strip('- ').strip() for q in alternatives if q.strip()]
# Example usage
query = "What are the prerequisites for the AI course?"
expanded = expand_query(query)
print(expanded)
# Search with all variations and combine results['What are the prerequisites for the AI course?', '1. Could you please list the requirements for the AI course?', '2. What essential knowledge or skills are needed before enrolling in the AI course?', '3. What should students know or do to be prepared for the AI course?']
7.6.2 Hypothetical Document Embeddings (HyDE)¶
Instead of embedding the query directly, generate a hypothetical answer and embed that. This often better matches document embeddings.
def hyde_retrieval(query: str, store: SimpleDocumentStore,
top_k: int = 3) -> List[Dict]:
"""Retrieve using hypothetical document embeddings."""
# Generate hypothetical answer
prompt = f"Write a brief, factual answer to: {query}"
response = ollama.generate(model="qwen2.5-coder:latest", prompt=prompt)
hypothetical_doc = response['response']
# Embed and search with hypothetical answer
hyp_emb = get_embedding(hypothetical_doc)
scores = [cosine_similarity(hyp_emb, doc_emb)
for doc_emb in store.embeddings]
top_indices = np.argsort(scores)[-top_k:][::-1]
return [store.documents[i] for i in top_indices]7.7 Context Management and Prompt Engineering¶
Retrieving documents is only half the battle. We need to present them effectively to the LLM.
7.7.1 Reranking Retrieved Documents¶
Not all retrieved documents are equally relevant. Reranking refines our initial retrieval.
def rerank_with_llm(query: str, documents: List[Dict]) -> List[Dict]:
"""Use LLM to rerank documents by relevance."""
scores = []
for doc in documents:
prompt = f"""On a scale of 0-10, how relevant is this document to the query?
Query: {query}
Document: {doc['text']}
Respond with only a number 0-10:"""
response = ollama.generate(model="qwen2.5-coder:latest", prompt=prompt)
try:
score = float(response['response'].strip())
except ValueError:
score = 5.0 # Default if parsing fails
scores.append(score)
# Sort by reranked scores
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, _ in ranked]7.7.2 Citation and Source Attribution¶
Good RAG systems cite their sources—crucial for trust and verification.
def rag_with_citations(question: str, store: SimpleDocumentStore) -> str:
"""RAG with source citations."""
relevant_docs = store.search(question, top_k=3)
# Number each document
context_parts = []
for i, doc in enumerate(relevant_docs, 1):
context_parts.append(f"[{i}] {doc['text']}")
context = "\n\n".join(context_parts)
prompt = f"""Answer the question using the provided sources.
Cite sources using [1], [2], etc.
Sources:
{context}
Question: {question}
Answer with citations:"""
response = ollama.generate(model="qwen2.5-coder:latest", prompt=prompt)
# Return answer with source texts
answer = response['response']
sources = "\n\n".join([f"[{i}] {doc['text']}"
for i, doc in enumerate(relevant_docs, 1)])
return f"{answer}\n\n---\nSources:\n{sources}"7.8 Handling Multi-Turn Conversations¶
RAG becomes more complex with conversation history. We need to maintain context across turns.
class ConversationalRAG:
def __init__(self, store: SimpleDocumentStore):
self.store = store
self.history = []
def query(self, user_message: str) -> str:
"""Handle conversational query with history."""
# Build conversation context
history_context = "\n".join([
f"User: {h['user']}\nAssistant: {h['assistant']}"
for h in self.history[-3:] # Last 3 turns
])
# Retrieve relevant documents
relevant_docs = self.store.search(user_message, top_k=2)
doc_context = "\n\n".join([doc["text"] for doc in relevant_docs])
prompt = f"""Previous conversation:
{history_context}
Relevant information:
{doc_context}
User: {user_message}
Assistant:"""
response = ollama.generate(model="qwen2.5-coder:latest", prompt=prompt)
answer = response['response']
# Update history
self.history.append({"user": user_message, "assistant": answer})
return answer
# Example usage
conv_rag = ConversationalRAG(store)
print(conv_rag.query("When are office hours?"))
print(conv_rag.query("Where exactly?")) # Follows up on previous questionOffice hours for CS 101 are on Mondays and Wednesdays from 2-4pm in Room 305.
Office hours for CS 101 are on Mondays and Wednesdays from 2-4pm in Room 305.
7.12 Building a Complete RAG Application¶
Let’s bring everything together into a “production-ready” system.
class ProductionRAGSystem:
def __init__(self, model: str = "qwen2.5-coder:latest"):
self.store = HybridSearchStore()
self.model = model
self.query_cache = {}
def ingest_documents(self, documents: List[str],
metadata: List[Dict] = None):
"""Ingest and chunk documents."""
for i, doc in enumerate(documents):
chunks = chunk_with_overlap(doc, chunk_size=300, overlap=50)
doc_metadata = metadata[i] if metadata else {}
for j, chunk in enumerate(chunks):
chunk_meta = {**doc_metadata, "chunk_id": j}
self.store.add_document(chunk, chunk_meta)
def query(self, question: str, use_cache: bool = True,
top_k: int = 3) -> Dict:
"""Main query interface with caching."""
# Check cache
if use_cache and question in self.query_cache:
return self.query_cache[question]
# Retrieve relevant documents
relevant_docs = self.store.hybrid_search(question, top_k=top_k)
# Rerank
relevant_docs = rerank_with_llm(question, relevant_docs)
# Build prompt with context
context = "\n\n".join([
f"[{i+1}] {doc['text']}"
for i, doc in enumerate(relevant_docs)
])
prompt = f"""Answer using the provided context. Cite sources with [1], [2], etc.
Context:
{context}
Question: {question}
Answer:"""
response = ollama.generate(model=self.model, prompt=prompt)
result = {
"answer": response['response'],
"sources": relevant_docs,
"context": context
}
# Cache result
if use_cache:
self.query_cache[question] = result
return result
# Example usage
rag = ProductionRAGSystem()
rag.ingest_documents([
"Python was created by Guido van Rossum in 1991. It emphasizes code readability.",
"Python supports multiple programming paradigms including procedural, OOP, and functional.",
"Python's standard library is extensive, covering file I/O, networking, and more."
])
result = rag.query("Who created Python?")
print(result["answer"])
print("\nSources:")
for i, source in enumerate(result["sources"], 1):
print(f"[{i}] {source['text'][:100]}...")Python was created by Guido van Rossum in 1991. [1]
Sources:
[1] Python was created by Guido van Rossum in 1991. It emphasizes code readability....
[2] Python's standard library is extensive, covering file I/O, networking, and more....
[3] Python supports multiple programming paradigms including procedural, OOP, and functional....