Complex systems have a habit of reinventing fundamental abstractions, and modern AI applications are no exception. Every sufficiently complex chatbot ends up recreating a loosely defined operating system of its own. Our job as designers is to build that operating system intentionally and well.
In previous chapters, we’ve explored the individual components: prompting strategies, retrieval augmentation, tool use, and agentic workflows. Now we face the central challenge of modern AI engineering: how do we compose these pieces into coherent, reliable, and maintainable systems?
This chapter is about architecture in the truest sense—not just code structure, but the art of making principled design decisions that balance competing concerns: accuracy vs. latency, flexibility vs. reliability, simplicity vs. capability.
The Landscape of AI Application Patterns¶
Before we dive into design principles, let’s map the terrain. Modern AI applications exist on a spectrum of complexity:
The Complexity Spectrum¶
Level 0: Direct Prompting
The simplest pattern—send a prompt, get a response. Suitable for well-defined tasks with clear success criteria.
import ollama
def summarize(text: str) -> str:
response = ollama.chat(model='llama3.2', messages=[
{'role': 'user', 'content': f'Summarize: {text}'}
])
return response['message']['content']Level 1: Prompt Chaining
Multiple LLM calls in sequence, each building on previous outputs. Think of it as a pipeline.
def analyze_sentiment_with_reasoning(review: str) -> dict:
# Step 1: Extract key aspects
aspects = ollama.chat(model='llama3.2', messages=[
{'role': 'user', 'content':
f'List the main aspects discussed in this review: {review}'}
])['message']['content']
# Step 2: Analyze sentiment per aspect
sentiment = ollama.chat(model='llama3.2', messages=[
{'role': 'user', 'content':
f'For each aspect, rate sentiment (positive/negative/neutral):\n{aspects}'}
])['message']['content']
return {'aspects': aspects, 'sentiment': sentiment}Level 2: Retrieval-Augmented Generation (RAG)
Augment prompts with retrieved context from external knowledge bases.
Level 3: Tool-Using Agents
LLMs that can call functions, query APIs, and interact with external systems.
Level 4: Multi-Agent Systems
Multiple specialized agents collaborating, debating, or working in parallel.
Each level adds power—and complexity. The designer’s challenge is choosing the minimum complexity that solves the problem reliably.
Core Design Principles¶
Principle 1: Decompose by Capability, Not Sequence¶
A common mistake is designing AI systems as linear pipelines when they should be capability graphs. Consider a research assistant:
Poor Design (Sequential):
Query → Retrieve → Summarize → AnswerBetter Design (Capability-Based):
Query → [Route] → Retrieve OR Calculate OR Web Search
↓
Synthesize → AnswerThe key insight: not every query needs every capability. Routing early saves computation and reduces error accumulation.
def route_query(query: str) -> str:
"""Determine what capability is needed."""
routing_prompt = f"""Classify this query into ONE category:
- FACTUAL: Needs document retrieval
- COMPUTATIONAL: Needs calculation
- CURRENT: Needs web search
- CREATIVE: Needs generation only
Query: {query}
Category:"""
response = ollama.chat(model='llama3.2', messages=[
{'role': 'user', 'content': routing_prompt}
])
return response['message']['content'].strip()
# Usage
query = "What's the population density of Tokyo?"
route = route_query(query) # Returns: COMPUTATIONALPrinciple 2: Separate Planning from Execution¶
The “ReAct” pattern (Reasoning + Acting) teaches us to make the LLM’s decision-making process explicit:
def react_loop(task: str, max_steps: int = 5):
"""ReAct pattern: Thought → Action → Observation."""
context = [{'role': 'user', 'content': task}]
for step in range(max_steps):
# Thought: What should I do next?
thought_prompt = """You can use these actions:
- SEARCH(query): Search knowledge base
- CALCULATE(expression): Evaluate math
- FINISH(answer): Complete task
What's your next step? Format: THOUGHT: ... ACTION: ..."""
context.append({'role': 'user', 'content': thought_prompt})
response = ollama.chat(model='llama3.2', messages=context)
output = response['message']['content']
context.append({'role': 'assistant', 'content': output})
# Parse action
if 'FINISH' in output:
return output.split('FINISH(')[1].split(')')[0]
elif 'SEARCH' in output:
query = output.split('SEARCH(')[1].split(')')[0]
result = search_kb(query) # Your search function
context.append({'role': 'user',
'content': f'OBSERVATION: {result}'})
# ... handle other actions
return "Max steps reached"This separation makes the system debuggable—you can inspect the reasoning chain and identify where things went wrong.
Principle 3: Design for Graceful Degradation¶
AI systems fail. Components become unavailable. Models hallucinate. Design for it:
from typing import Optional
import logging
class ResilientRAG:
def __init__(self, primary_model='llama3.2', fallback_model='llama3.1'):
self.primary = primary_model
self.fallback = fallback_model
def retrieve_with_fallback(self, query: str) -> Optional[str]:
"""Try multiple retrieval strategies."""
try:
# Primary: Vector search
results = self.vector_search(query)
if results:
return results
except Exception as e:
logging.warning(f"Vector search failed: {e}")
try:
# Fallback: Keyword search
return self.keyword_search(query)
except Exception as e:
logging.error(f"All retrieval failed: {e}")
return None
def generate(self, query: str, context: Optional[str]) -> str:
"""Generate with or without context."""
if context:
prompt = f"Context: {context}\n\nQuestion: {query}"
else:
prompt = f"Question: {query}\n(Note: No context available)"
try:
return ollama.chat(model=self.primary,
messages=[{'role': 'user', 'content': prompt}])
except:
# Fallback model
return ollama.chat(model=self.fallback,
messages=[{'role': 'user', 'content': prompt}])12.3 Pattern: Query Understanding and Decomposition¶
Complex queries often need decomposition before processing. Let’s build a query analyzer:
def decompose_query(query: str) -> dict:
"""Break complex queries into sub-questions."""
prompt = f"""Analyze this query and break it into atomic sub-questions.
Format as JSON: {{"sub_questions": [...], "requires_synthesis": bool}}
Query: {query}
Analysis:"""
response = ollama.chat(model='llama3.2', messages=[
{'role': 'system', 'content': 'You output only valid JSON.'},
{'role': 'user', 'content': prompt}
])
import json
return json.loads(response['message']['content'])
# Example: Interdisciplinary query
query = "How did the printing press affect Renaissance art and what parallels exist with AI's impact on modern design?"
decomposed = decompose_query(query)
# Result:
# {
# "sub_questions": [
# "What was the impact of the printing press on Renaissance art?",
# "How is AI impacting modern design?",
# "What are the structural similarities between these two technological shifts?"
# ],
# "requires_synthesis": true
# }This pattern is powerful for research assistants, educational tools, and any application dealing with multi-faceted questions.
Pattern: Retrieval-Augmented Generation (RAG) Design¶
RAG is the workhorse of modern AI applications. But naive RAG often fails. Let’s build a robust version:
The RAG Triangle¶
Every RAG system must balance three concerns:
Retrieval Quality: Getting the right documents
Context Management: Fitting information within token limits
Generation Fidelity: Staying grounded in retrieved content
import numpy as np
from typing import List, Tuple
class AdvancedRAG:
def __init__(self, model='llama3.2', chunk_size=500):
self.model = model
self.chunk_size = chunk_size
self.knowledge_base = [] # List of (text, embedding) tuples
def add_document(self, doc: str):
"""Chunk and embed document."""
chunks = self._chunk_text(doc, self.chunk_size)
for chunk in chunks:
embedding = self._embed(chunk)
self.knowledge_base.append((chunk, embedding))
def _chunk_text(self, text: str, size: int) -> List[str]:
"""Semantic chunking (simplified)."""
words = text.split()
return [' '.join(words[i:i+size])
for i in range(0, len(words), size)]
def _embed(self, text: str) -> np.ndarray:
"""Get embedding from Ollama."""
response = ollama.embeddings(model='nomic-embed-text', prompt=text)
return np.array(response['embedding'])
def retrieve(self, query: str, top_k: int = 3) -> List[str]:
"""Retrieve most relevant chunks."""
query_emb = self._embed(query)
# Compute similarities
similarities = []
for chunk, chunk_emb in self.knowledge_base:
sim = np.dot(query_emb, chunk_emb) / (
np.linalg.norm(query_emb) * np.linalg.norm(chunk_emb)
)
similarities.append((chunk, sim))
# Sort and return top-k
similarities.sort(key=lambda x: x[1], reverse=True)
return [chunk for chunk, _ in similarities[:top_k]]
def query(self, question: str) -> str:
"""RAG query with source attribution."""
# Retrieve
context_chunks = self.retrieve(question, top_k=3)
context = "\n\n".join([f"[{i+1}] {chunk}"
for i, chunk in enumerate(context_chunks)])
# Generate
prompt = f"""Answer the question using ONLY the provided context.
Cite sources using [1], [2], [3].
If the context doesn't contain the answer, say so.
Context:
{context}
Question: {question}
Answer:"""
response = ollama.chat(model=self.model, messages=[
{'role': 'user', 'content': prompt}
])
return response['message']['content']
# Usage example
rag = AdvancedRAG()
rag.add_document("""
The Bauhaus school, founded in 1919 in Weimar, Germany,
revolutionized design education by integrating art, craft, and technology.
Its influence on modern design is immeasurable, establishing principles
of form following function and geometric simplicity.
""")
answer = rag.query("When was Bauhaus founded?")
print(answer) # Should cite [1] and give 1919Improving RAG: Hypothetical Document Embeddings (HyDE)¶
A clever technique: instead of embedding the query directly, first generate a hypothetical answer, then embed that:
def hyde_retrieve(self, query: str, top_k: int = 3) -> List[str]:
"""HyDE: Generate hypothetical answer, then retrieve."""
# Step 1: Generate hypothetical answer
hyp_prompt = f"Write a detailed answer to: {query}"
hyp_response = ollama.chat(model=self.model, messages=[
{'role': 'user', 'content': hyp_prompt}
])
hypothetical_answer = hyp_response['message']['content']
# Step 2: Embed hypothetical answer
hyp_emb = self._embed(hypothetical_answer)
# Step 3: Retrieve using hypothetical embedding
similarities = []
for chunk, chunk_emb in self.knowledge_base:
sim = np.dot(hyp_emb, chunk_emb) / (
np.linalg.norm(hyp_emb) * np.linalg.norm(chunk_emb)
)
similarities.append((chunk, sim))
similarities.sort(key=lambda x: x[1], reverse=True)
return [chunk for chunk, _ in similarities[:top_k]]Why does this work? Because hypothetical answers often use vocabulary and phrasing closer to actual documents than queries do.
12.5 Pattern: Tool Use and Function Calling¶
Modern LLMs can use tools—but only if we design the interface carefully.
The Tool Contract¶
Every tool needs a clear specification:
from typing import Callable, Any
from dataclasses import dataclass
@dataclass
class Tool:
name: str
description: str
parameters: dict
function: Callable
# Define tools
def calculate(expression: str) -> float:
"""Safely evaluate mathematical expressions."""
import ast
import operator as op
# Supported operators
operators = {
ast.Add: op.add, ast.Sub: op.sub,
ast.Mult: op.mul, ast.Div: op.truediv,
ast.Pow: op.pow
}
def eval_expr(node):
if isinstance(node, ast.Num):
return node.n
elif isinstance(node, ast.BinOp):
return operators[type(node.op)](
eval_expr(node.left), eval_expr(node.right)
)
raise ValueError(f"Unsupported operation")
return eval_expr(ast.parse(expression, mode='eval').body)
tools = [
Tool(
name="calculate",
description="Evaluate mathematical expressions. Use for any computation.",
parameters={"expression": "str (e.g., '2 + 2' or '3.14 * 5**2')"},
function=calculate
)
]Tool-Using Agent¶
def agent_with_tools(task: str, tools: List[Tool], max_iter: int = 5) -> str:
"""Agent that can use tools to solve tasks."""
# Build tool descriptions
tool_desc = "\n".join([
f"- {t.name}: {t.description}\n Parameters: {t.parameters}"
for t in tools
])
conversation = [{
'role': 'system',
'content': f"""You are a helpful assistant with access to tools.
When you need to use a tool, output:
USE_TOOL: tool_name(arg1, arg2, ...)
Available tools:
{tool_desc}
Think step by step. Use tools when needed."""
}]
conversation.append({'role': 'user', 'content': task})
for iteration in range(max_iter):
# Get LLM response
response = ollama.chat(model='llama3.2', messages=conversation)
output = response['message']['content']
conversation.append({'role': 'assistant', 'content': output})
# Check if tool use is requested
if 'USE_TOOL:' in output:
# Parse tool call
tool_line = [l for l in output.split('\n') if 'USE_TOOL:' in l][0]
tool_call = tool_line.split('USE_TOOL:')[1].strip()
tool_name = tool_call.split('(')[0]
tool_args = tool_call.split('(')[1].split(')')[0]
# Execute tool
tool = next(t for t in tools if t.name == tool_name)
result = tool.function(tool_args)
# Feed result back
conversation.append({
'role': 'user',
'content': f"Tool result: {result}"
})
else:
# No tool use, we're done
return output
return "Max iterations reached"
# Example
task = "What's 15% of 340 plus 22?"
result = agent_with_tools(task, tools)
print(result)Pattern: Multi-Step Reasoning with Chain-of-Thought¶
For complex reasoning, make the thinking process explicit:
def chain_of_thought(problem: str) -> str:
"""Solve problems with explicit reasoning steps."""
prompt = f"""Solve this problem step by step.
Format:
Step 1: [describe what you're doing]
Step 2: [next step]
...
Final Answer: [your answer]
Problem: {problem}
Solution:"""
response = ollama.chat(model='llama3.2', messages=[
{'role': 'user', 'content': prompt}
])
return response['message']['content']
# Example: Logic puzzle
problem = """
If all bloops are razzies and all razzies are lazzies,
and no lazzies are kazzies, can a bloop be a kazzie?
"""
solution = chain_of_thought(problem)Tree-of-Thoughts: Exploring Multiple Reasoning Paths¶
For even harder problems, explore multiple reasoning paths:
def tree_of_thoughts(problem: str, branches: int = 3) -> str:
"""Explore multiple solution paths, select best."""
paths = []
for i in range(branches):
prompt = f"""Solve this problem. Try approach #{i+1}.
Think creatively. Show your reasoning.
Problem: {problem}
Solution:"""
response = ollama.chat(model='llama3.2', messages=[
{'role': 'user', 'content': prompt}
])
paths.append(response['message']['content'])
# Evaluate paths
eval_prompt = f"""You saw {branches} solution attempts:
{chr(10).join([f'Attempt {i+1}:\n{p}\n' for i, p in enumerate(paths)])}
Which solution is most correct? Explain why and provide the final answer.
Evaluation:"""
evaluation = ollama.chat(model='llama3.2', messages=[
{'role': 'user', 'content': eval_prompt}
])
return evaluation['message']['content']This pattern is powerful for creative problems, mathematical proofs, and strategic planning.
Pattern: Iterative Refinement¶
Sometimes the first output isn’t quite right. Build in self-correction:
def iterative_refinement(task: str, max_refinements: int = 2) -> str:
"""Generate, critique, refine."""
# Initial generation
response = ollama.chat(model='llama3.2', messages=[
{'role': 'user', 'content': task}
])
output = response['message']['content']
for i in range(max_refinements):
# Self-critique
critique_prompt = f"""Review this output for the task.
Identify 2-3 specific issues or improvements needed.
Task: {task}
Output: {output}
Critique:"""
critique = ollama.chat(model='llama3.2', messages=[
{'role': 'user', 'content': critique_prompt}
])['message']['content']
# Refine
refine_prompt = f"""Improve this output based on the critique.
Original task: {task}
Previous output: {output}
Critique: {critique}
Improved output:"""
response = ollama.chat(model='llama3.2', messages=[
{'role': 'user', 'content': refine_prompt}
])
output = response['message']['content']
return output
# Usage
task = "Write a haiku about neural networks"
refined = iterative_refinement(task)Design Pattern: Semantic Routing¶
Not all queries need the same processing. Route intelligently:
from enum import Enum
class QueryType(Enum):
SIMPLE_FACT = "simple_fact"
COMPLEX_REASONING = "complex_reasoning"
CREATIVE = "creative"
COMPUTATIONAL = "computational"
class SemanticRouter:
def __init__(self, model='llama3.2'):
self.model = model
def classify(self, query: str) -> QueryType:
"""Classify query type."""
prompt = f"""Classify this query into ONE category:
- SIMPLE_FACT: Direct factual question
- COMPLEX_REASONING: Requires multi-step logic
- CREATIVE: Creative writing or brainstorming
- COMPUTATIONAL: Math or data analysis
Query: {query}
Category:"""
response = ollama.chat(model=self.model, messages=[
{'role': 'user', 'content': prompt}
])
result = response['message']['content'].strip().upper()
if 'SIMPLE_FACT' in result:
return QueryType.SIMPLE_FACT
elif 'COMPLEX' in result:
return QueryType.COMPLEX_REASONING
elif 'CREATIVE' in result:
return QueryType.CREATIVE
else:
return QueryType.COMPUTATIONAL
def route(self, query: str) -> str:
"""Route to appropriate handler."""
query_type = self.classify(query)
if query_type == QueryType.SIMPLE_FACT:
return self.handle_simple(query)
elif query_type == QueryType.COMPLEX_REASONING:
return chain_of_thought(query)
elif query_type == QueryType.CREATIVE:
return self.handle_creative(query)
else:
return agent_with_tools(query, tools)
def handle_simple(self, query: str) -> str:
"""Fast path for simple queries."""
response = ollama.chat(model=self.model, messages=[
{'role': 'user', 'content': f"Answer briefly: {query}"}
])
return response['message']['content']
def handle_creative(self, query: str) -> str:
"""Creative generation with higher temperature."""
response = ollama.chat(
model=self.model,
messages=[{'role': 'user', 'content': query}],
options={'temperature': 0.8}
)
return response['message']['content']
# Usage
router = SemanticRouter()
print(router.route("What's the capital of France?")) # Simple
print(router.route("Design a sustainable city for 2050")) # CreativePattern: Context Management and Memory¶
AI applications need memory across turns. Let’s build a conversation manager:
class ConversationMemory:
def __init__(self, max_tokens: int = 2000):
self.messages = []
self.max_tokens = max_tokens
def add(self, role: str, content: str):
"""Add message to history."""
self.messages.append({'role': role, 'content': content})
self._trim_if_needed()
def _trim_if_needed(self):
"""Keep only recent messages within token limit."""
# Rough token estimation: 1 token ≈ 4 chars
total_chars = sum(len(m['content']) for m in self.messages)
estimated_tokens = total_chars / 4
while estimated_tokens > self.max_tokens and len(self.messages) > 2:
# Remove oldest non-system message
for i, msg in enumerate(self.messages):
if msg['role'] != 'system':
self.messages.pop(i)
break
total_chars = sum(len(m['content']) for m in self.messages)
estimated_tokens = total_chars / 4
def get_messages(self):
return self.messages
def summarize_and_compress(self):
"""Compress old messages into summary."""
if len(self.messages) < 5:
return
# Take old messages (except most recent 2)
old_messages = self.messages[:-2]
recent_messages = self.messages[-2:]
# Summarize
conversation_text = "\n".join([
f"{m['role']}: {m['content']}" for m in old_messages
])
summary_response = ollama.chat(model='llama3.2', messages=[
{'role': 'user',
'content': f"Summarize this conversation:\n{conversation_text}"}
])
summary = summary_response['message']['content']
# Replace with summary
self.messages = [
{'role': 'system', 'content': f"Previous conversation summary: {summary}"}
] + recent_messages
# Usage
memory = ConversationMemory()
memory.add('system', 'You are a helpful assistant.')
memory.add('user', 'What is machine learning?')
memory.add('assistant', 'Machine learning is...')
memory.add('user', 'Can you give an example?')
response = ollama.chat(model='llama3.2', messages=memory.get_messages())
memory.add('assistant', response['message']['content'])Semantic Memory: Remember What Matters¶
Instead of keeping chronological history, index by semantic meaning:
class SemanticMemory:
def __init__(self):
self.memories = [] # List of (content, embedding, timestamp)
def remember(self, content: str):
"""Store with semantic embedding."""
embedding = self._embed(content)
import time
self.memories.append((content, embedding, time.time()))
def recall(self, query: str, top_k: int = 3) -> List[str]:
"""Retrieve semantically similar memories."""
query_emb = self._embed(query)
similarities = []
for content, emb, timestamp in self.memories:
sim = np.dot(query_emb, emb) / (
np.linalg.norm(query_emb) * np.linalg.norm(emb)
)
# Decay older memories slightly
import time
age_hours = (time.time() - timestamp) / 3600
decayed_sim = sim * (0.99 ** age_hours)
similarities.append((content, decayed_sim))
similarities.sort(key=lambda x: x[1], reverse=True)
return [content for content, _ in similarities[:top_k]]
def _embed(self, text: str) -> np.ndarray:
response = ollama.embeddings(model='nomic-embed-text', prompt=text)
return np.array(response['embedding'])Pattern: Multi-Agent Collaboration¶
When tasks are complex, divide and conquer with specialized agents:
from typing import List
class Agent:
def __init__(self, name: str, role: str, model='llama3.2'):
self.name = name
self.role = role
self.model = model
def respond(self, context: str, task: str) -> str:
"""Generate response based on role."""
prompt = f"""You are {self.name}, {self.role}.
Context: {context}
Task: {task}
Your response:"""
response = ollama.chat(model=self.model, messages=[
{'role': 'user', 'content': prompt}
])
return response['message']['content']
class MultiAgentSystem:
def __init__(self, agents: List[Agent]):
self.agents = agents
self.conversation_history = []
def collaborate(self, task: str, rounds: int = 2) -> str:
"""Agents take turns addressing the task."""
context = f"Initial task: {task}"
for round_num in range(rounds):
for agent in self.agents:
response = agent.respond(context, task)
self.conversation_history.append(
f"{agent.name}: {response}"
)
context += f"\n{agent.name}: {response}"
# Final synthesis
synthesis_prompt = f"""Review this multi-agent discussion and provide
a final synthesized answer.
Discussion:
{context}
Synthesized answer:"""
final = ollama.chat(model='llama3.2', messages=[
{'role': 'user', 'content': synthesis_prompt}
])
return final['message']['content']
# Example: Interdisciplinary analysis
agents = [
Agent("Historian", "an expert in Renaissance history"),
Agent("Technologist", "an expert in printing technology"),
Agent("Art Critic", "an expert in art history and criticism")
]
system = MultiAgentSystem(agents)
result = system.collaborate(
"How did printing technology influence Renaissance art?"
)This pattern shines for problems requiring multiple perspectives: design reviews, research synthesis, strategic planning.
Error Handling and Reliability Patterns¶
Production AI systems must handle failures gracefully:
from typing import Optional, Callable
import logging
class ReliableAI:
def __init__(self, model='llama3.2', max_retries: int = 3):
self.model = model
self.max_retries = max_retries
def call_with_retry(
self,
messages: list,
validator: Optional[Callable] = None
) -> str:
"""Call LLM with exponential backoff retry."""
import time
for attempt in range(self.max_retries):
try:
response = ollama.chat(model=self.model, messages=messages)
result = response['message']['content']
# Validate if validator provided
if validator and not validator(result):
raise ValueError("Output failed validation")
return result
except Exception as e:
if attempt == self.max_retries - 1:
logging.error(f"Failed after {self.max_retries} attempts: {e}")
raise
wait_time = 2 ** attempt # Exponential backoff
logging.warning(f"Attempt {attempt + 1} failed, retrying in {wait_time}s")
time.sleep(wait_time)
def call_with_fallback(
self,
messages: