Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Complex systems have a habit of reinventing fundamental abstractions, and modern AI applications are no exception. Every sufficiently complex chatbot ends up recreating a loosely defined operating system of its own. Our job as designers is to build that operating system intentionally and well.

In previous chapters, we’ve explored the individual components: prompting strategies, retrieval augmentation, tool use, and agentic workflows. Now we face the central challenge of modern AI engineering: how do we compose these pieces into coherent, reliable, and maintainable systems?

This chapter is about architecture in the truest sense—not just code structure, but the art of making principled design decisions that balance competing concerns: accuracy vs. latency, flexibility vs. reliability, simplicity vs. capability.

The Landscape of AI Application Patterns

Before we dive into design principles, let’s map the terrain. Modern AI applications exist on a spectrum of complexity:

The Complexity Spectrum

Level 0: Direct Prompting
The simplest pattern—send a prompt, get a response. Suitable for well-defined tasks with clear success criteria.

import ollama

def summarize(text: str) -> str:
    response = ollama.chat(model='llama3.2', messages=[
        {'role': 'user', 'content': f'Summarize: {text}'}
    ])
    return response['message']['content']

Level 1: Prompt Chaining
Multiple LLM calls in sequence, each building on previous outputs. Think of it as a pipeline.

def analyze_sentiment_with_reasoning(review: str) -> dict:
    # Step 1: Extract key aspects
    aspects = ollama.chat(model='llama3.2', messages=[
        {'role': 'user', 'content': 
         f'List the main aspects discussed in this review: {review}'}
    ])['message']['content']
    
    # Step 2: Analyze sentiment per aspect
    sentiment = ollama.chat(model='llama3.2', messages=[
        {'role': 'user', 'content': 
         f'For each aspect, rate sentiment (positive/negative/neutral):\n{aspects}'}
    ])['message']['content']
    
    return {'aspects': aspects, 'sentiment': sentiment}

Level 2: Retrieval-Augmented Generation (RAG)
Augment prompts with retrieved context from external knowledge bases.

Level 3: Tool-Using Agents
LLMs that can call functions, query APIs, and interact with external systems.

Level 4: Multi-Agent Systems
Multiple specialized agents collaborating, debating, or working in parallel.

Each level adds power—and complexity. The designer’s challenge is choosing the minimum complexity that solves the problem reliably.

Core Design Principles

Principle 1: Decompose by Capability, Not Sequence

A common mistake is designing AI systems as linear pipelines when they should be capability graphs. Consider a research assistant:

Poor Design (Sequential):

Query → Retrieve → Summarize → Answer

Better Design (Capability-Based):

Query → [Route] → Retrieve OR Calculate OR Web Search
                ↓
           Synthesize → Answer

The key insight: not every query needs every capability. Routing early saves computation and reduces error accumulation.

def route_query(query: str) -> str:
    """Determine what capability is needed."""
    routing_prompt = f"""Classify this query into ONE category:
- FACTUAL: Needs document retrieval
- COMPUTATIONAL: Needs calculation
- CURRENT: Needs web search
- CREATIVE: Needs generation only

Query: {query}
Category:"""
    
    response = ollama.chat(model='llama3.2', messages=[
        {'role': 'user', 'content': routing_prompt}
    ])
    return response['message']['content'].strip()

# Usage
query = "What's the population density of Tokyo?"
route = route_query(query)  # Returns: COMPUTATIONAL

Principle 2: Separate Planning from Execution

The “ReAct” pattern (Reasoning + Acting) teaches us to make the LLM’s decision-making process explicit:

def react_loop(task: str, max_steps: int = 5):
    """ReAct pattern: Thought → Action → Observation."""
    context = [{'role': 'user', 'content': task}]
    
    for step in range(max_steps):
        # Thought: What should I do next?
        thought_prompt = """You can use these actions:
- SEARCH(query): Search knowledge base
- CALCULATE(expression): Evaluate math
- FINISH(answer): Complete task

What's your next step? Format: THOUGHT: ... ACTION: ..."""
        
        context.append({'role': 'user', 'content': thought_prompt})
        response = ollama.chat(model='llama3.2', messages=context)
        output = response['message']['content']
        context.append({'role': 'assistant', 'content': output})
        
        # Parse action
        if 'FINISH' in output:
            return output.split('FINISH(')[1].split(')')[0]
        elif 'SEARCH' in output:
            query = output.split('SEARCH(')[1].split(')')[0]
            result = search_kb(query)  # Your search function
            context.append({'role': 'user', 
                          'content': f'OBSERVATION: {result}'})
        # ... handle other actions
    
    return "Max steps reached"

This separation makes the system debuggable—you can inspect the reasoning chain and identify where things went wrong.

Principle 3: Design for Graceful Degradation

AI systems fail. Components become unavailable. Models hallucinate. Design for it:

from typing import Optional
import logging

class ResilientRAG:
    def __init__(self, primary_model='llama3.2', fallback_model='llama3.1'):
        self.primary = primary_model
        self.fallback = fallback_model
    
    def retrieve_with_fallback(self, query: str) -> Optional[str]:
        """Try multiple retrieval strategies."""
        try:
            # Primary: Vector search
            results = self.vector_search(query)
            if results:
                return results
        except Exception as e:
            logging.warning(f"Vector search failed: {e}")
        
        try:
            # Fallback: Keyword search
            return self.keyword_search(query)
        except Exception as e:
            logging.error(f"All retrieval failed: {e}")
            return None
    
    def generate(self, query: str, context: Optional[str]) -> str:
        """Generate with or without context."""
        if context:
            prompt = f"Context: {context}\n\nQuestion: {query}"
        else:
            prompt = f"Question: {query}\n(Note: No context available)"
        
        try:
            return ollama.chat(model=self.primary, 
                             messages=[{'role': 'user', 'content': prompt}])
        except:
            # Fallback model
            return ollama.chat(model=self.fallback,
                             messages=[{'role': 'user', 'content': prompt}])

12.3 Pattern: Query Understanding and Decomposition

Complex queries often need decomposition before processing. Let’s build a query analyzer:

def decompose_query(query: str) -> dict:
    """Break complex queries into sub-questions."""
    prompt = f"""Analyze this query and break it into atomic sub-questions.
Format as JSON: {{"sub_questions": [...], "requires_synthesis": bool}}

Query: {query}
Analysis:"""
    
    response = ollama.chat(model='llama3.2', messages=[
        {'role': 'system', 'content': 'You output only valid JSON.'},
        {'role': 'user', 'content': prompt}
    ])
    
    import json
    return json.loads(response['message']['content'])

# Example: Interdisciplinary query
query = "How did the printing press affect Renaissance art and what parallels exist with AI's impact on modern design?"

decomposed = decompose_query(query)
# Result:
# {
#   "sub_questions": [
#     "What was the impact of the printing press on Renaissance art?",
#     "How is AI impacting modern design?",
#     "What are the structural similarities between these two technological shifts?"
#   ],
#   "requires_synthesis": true
# }

This pattern is powerful for research assistants, educational tools, and any application dealing with multi-faceted questions.

Pattern: Retrieval-Augmented Generation (RAG) Design

RAG is the workhorse of modern AI applications. But naive RAG often fails. Let’s build a robust version:

The RAG Triangle

Every RAG system must balance three concerns:

  1. Retrieval Quality: Getting the right documents

  2. Context Management: Fitting information within token limits

  3. Generation Fidelity: Staying grounded in retrieved content

import numpy as np
from typing import List, Tuple

class AdvancedRAG:
    def __init__(self, model='llama3.2', chunk_size=500):
        self.model = model
        self.chunk_size = chunk_size
        self.knowledge_base = []  # List of (text, embedding) tuples
    
    def add_document(self, doc: str):
        """Chunk and embed document."""
        chunks = self._chunk_text(doc, self.chunk_size)
        for chunk in chunks:
            embedding = self._embed(chunk)
            self.knowledge_base.append((chunk, embedding))
    
    def _chunk_text(self, text: str, size: int) -> List[str]:
        """Semantic chunking (simplified)."""
        words = text.split()
        return [' '.join(words[i:i+size]) 
                for i in range(0, len(words), size)]
    
    def _embed(self, text: str) -> np.ndarray:
        """Get embedding from Ollama."""
        response = ollama.embeddings(model='nomic-embed-text', prompt=text)
        return np.array(response['embedding'])
    
    def retrieve(self, query: str, top_k: int = 3) -> List[str]:
        """Retrieve most relevant chunks."""
        query_emb = self._embed(query)
        
        # Compute similarities
        similarities = []
        for chunk, chunk_emb in self.knowledge_base:
            sim = np.dot(query_emb, chunk_emb) / (
                np.linalg.norm(query_emb) * np.linalg.norm(chunk_emb)
            )
            similarities.append((chunk, sim))
        
        # Sort and return top-k
        similarities.sort(key=lambda x: x[1], reverse=True)
        return [chunk for chunk, _ in similarities[:top_k]]
    
    def query(self, question: str) -> str:
        """RAG query with source attribution."""
        # Retrieve
        context_chunks = self.retrieve(question, top_k=3)
        context = "\n\n".join([f"[{i+1}] {chunk}" 
                               for i, chunk in enumerate(context_chunks)])
        
        # Generate
        prompt = f"""Answer the question using ONLY the provided context.
Cite sources using [1], [2], [3].
If the context doesn't contain the answer, say so.

Context:
{context}

Question: {question}

Answer:"""
        
        response = ollama.chat(model=self.model, messages=[
            {'role': 'user', 'content': prompt}
        ])
        
        return response['message']['content']

# Usage example
rag = AdvancedRAG()
rag.add_document("""
The Bauhaus school, founded in 1919 in Weimar, Germany, 
revolutionized design education by integrating art, craft, and technology.
Its influence on modern design is immeasurable, establishing principles
of form following function and geometric simplicity.
""")

answer = rag.query("When was Bauhaus founded?")
print(answer)  # Should cite [1] and give 1919

Improving RAG: Hypothetical Document Embeddings (HyDE)

A clever technique: instead of embedding the query directly, first generate a hypothetical answer, then embed that:

def hyde_retrieve(self, query: str, top_k: int = 3) -> List[str]:
    """HyDE: Generate hypothetical answer, then retrieve."""
    # Step 1: Generate hypothetical answer
    hyp_prompt = f"Write a detailed answer to: {query}"
    hyp_response = ollama.chat(model=self.model, messages=[
        {'role': 'user', 'content': hyp_prompt}
    ])
    hypothetical_answer = hyp_response['message']['content']
    
    # Step 2: Embed hypothetical answer
    hyp_emb = self._embed(hypothetical_answer)
    
    # Step 3: Retrieve using hypothetical embedding
    similarities = []
    for chunk, chunk_emb in self.knowledge_base:
        sim = np.dot(hyp_emb, chunk_emb) / (
            np.linalg.norm(hyp_emb) * np.linalg.norm(chunk_emb)
        )
        similarities.append((chunk, sim))
    
    similarities.sort(key=lambda x: x[1], reverse=True)
    return [chunk for chunk, _ in similarities[:top_k]]

Why does this work? Because hypothetical answers often use vocabulary and phrasing closer to actual documents than queries do.

12.5 Pattern: Tool Use and Function Calling

Modern LLMs can use tools—but only if we design the interface carefully.

The Tool Contract

Every tool needs a clear specification:

from typing import Callable, Any
from dataclasses import dataclass

@dataclass
class Tool:
    name: str
    description: str
    parameters: dict
    function: Callable

# Define tools
def calculate(expression: str) -> float:
    """Safely evaluate mathematical expressions."""
    import ast
    import operator as op
    
    # Supported operators
    operators = {
        ast.Add: op.add, ast.Sub: op.sub,
        ast.Mult: op.mul, ast.Div: op.truediv,
        ast.Pow: op.pow
    }
    
    def eval_expr(node):
        if isinstance(node, ast.Num):
            return node.n
        elif isinstance(node, ast.BinOp):
            return operators[type(node.op)](
                eval_expr(node.left), eval_expr(node.right)
            )
        raise ValueError(f"Unsupported operation")
    
    return eval_expr(ast.parse(expression, mode='eval').body)

tools = [
    Tool(
        name="calculate",
        description="Evaluate mathematical expressions. Use for any computation.",
        parameters={"expression": "str (e.g., '2 + 2' or '3.14 * 5**2')"},
        function=calculate
    )
]

Tool-Using Agent

def agent_with_tools(task: str, tools: List[Tool], max_iter: int = 5) -> str:
    """Agent that can use tools to solve tasks."""
    # Build tool descriptions
    tool_desc = "\n".join([
        f"- {t.name}: {t.description}\n  Parameters: {t.parameters}"
        for t in tools
    ])
    
    conversation = [{
        'role': 'system',
        'content': f"""You are a helpful assistant with access to tools.
When you need to use a tool, output:
USE_TOOL: tool_name(arg1, arg2, ...)

Available tools:
{tool_desc}

Think step by step. Use tools when needed."""
    }]
    
    conversation.append({'role': 'user', 'content': task})
    
    for iteration in range(max_iter):
        # Get LLM response
        response = ollama.chat(model='llama3.2', messages=conversation)
        output = response['message']['content']
        conversation.append({'role': 'assistant', 'content': output})
        
        # Check if tool use is requested
        if 'USE_TOOL:' in output:
            # Parse tool call
            tool_line = [l for l in output.split('\n') if 'USE_TOOL:' in l][0]
            tool_call = tool_line.split('USE_TOOL:')[1].strip()
            tool_name = tool_call.split('(')[0]
            tool_args = tool_call.split('(')[1].split(')')[0]
            
            # Execute tool
            tool = next(t for t in tools if t.name == tool_name)
            result = tool.function(tool_args)
            
            # Feed result back
            conversation.append({
                'role': 'user',
                'content': f"Tool result: {result}"
            })
        else:
            # No tool use, we're done
            return output
    
    return "Max iterations reached"

# Example
task = "What's 15% of 340 plus 22?"
result = agent_with_tools(task, tools)
print(result)

Pattern: Multi-Step Reasoning with Chain-of-Thought

For complex reasoning, make the thinking process explicit:

def chain_of_thought(problem: str) -> str:
    """Solve problems with explicit reasoning steps."""
    prompt = f"""Solve this problem step by step.
Format:
Step 1: [describe what you're doing]
Step 2: [next step]
...
Final Answer: [your answer]

Problem: {problem}

Solution:"""
    
    response = ollama.chat(model='llama3.2', messages=[
        {'role': 'user', 'content': prompt}
    ])
    return response['message']['content']

# Example: Logic puzzle
problem = """
If all bloops are razzies and all razzies are lazzies,
and no lazzies are kazzies, can a bloop be a kazzie?
"""

solution = chain_of_thought(problem)

Tree-of-Thoughts: Exploring Multiple Reasoning Paths

For even harder problems, explore multiple reasoning paths:

def tree_of_thoughts(problem: str, branches: int = 3) -> str:
    """Explore multiple solution paths, select best."""
    paths = []
    
    for i in range(branches):
        prompt = f"""Solve this problem. Try approach #{i+1}.
Think creatively. Show your reasoning.

Problem: {problem}

Solution:"""
        
        response = ollama.chat(model='llama3.2', messages=[
            {'role': 'user', 'content': prompt}
        ])
        paths.append(response['message']['content'])
    
    # Evaluate paths
    eval_prompt = f"""You saw {branches} solution attempts:

{chr(10).join([f'Attempt {i+1}:\n{p}\n' for i, p in enumerate(paths)])}

Which solution is most correct? Explain why and provide the final answer.

Evaluation:"""
    
    evaluation = ollama.chat(model='llama3.2', messages=[
        {'role': 'user', 'content': eval_prompt}
    ])
    
    return evaluation['message']['content']

This pattern is powerful for creative problems, mathematical proofs, and strategic planning.

Pattern: Iterative Refinement

Sometimes the first output isn’t quite right. Build in self-correction:

def iterative_refinement(task: str, max_refinements: int = 2) -> str:
    """Generate, critique, refine."""
    # Initial generation
    response = ollama.chat(model='llama3.2', messages=[
        {'role': 'user', 'content': task}
    ])
    output = response['message']['content']
    
    for i in range(max_refinements):
        # Self-critique
        critique_prompt = f"""Review this output for the task.
Identify 2-3 specific issues or improvements needed.

Task: {task}
Output: {output}

Critique:"""
        
        critique = ollama.chat(model='llama3.2', messages=[
            {'role': 'user', 'content': critique_prompt}
        ])['message']['content']
        
        # Refine
        refine_prompt = f"""Improve this output based on the critique.

Original task: {task}
Previous output: {output}
Critique: {critique}

Improved output:"""
        
        response = ollama.chat(model='llama3.2', messages=[
            {'role': 'user', 'content': refine_prompt}
        ])
        output = response['message']['content']
    
    return output

# Usage
task = "Write a haiku about neural networks"
refined = iterative_refinement(task)

Design Pattern: Semantic Routing

Not all queries need the same processing. Route intelligently:

from enum import Enum

class QueryType(Enum):
    SIMPLE_FACT = "simple_fact"
    COMPLEX_REASONING = "complex_reasoning"
    CREATIVE = "creative"
    COMPUTATIONAL = "computational"

class SemanticRouter:
    def __init__(self, model='llama3.2'):
        self.model = model
    
    def classify(self, query: str) -> QueryType:
        """Classify query type."""
        prompt = f"""Classify this query into ONE category:
- SIMPLE_FACT: Direct factual question
- COMPLEX_REASONING: Requires multi-step logic
- CREATIVE: Creative writing or brainstorming
- COMPUTATIONAL: Math or data analysis

Query: {query}
Category:"""
        
        response = ollama.chat(model=self.model, messages=[
            {'role': 'user', 'content': prompt}
        ])
        result = response['message']['content'].strip().upper()
        
        if 'SIMPLE_FACT' in result:
            return QueryType.SIMPLE_FACT
        elif 'COMPLEX' in result:
            return QueryType.COMPLEX_REASONING
        elif 'CREATIVE' in result:
            return QueryType.CREATIVE
        else:
            return QueryType.COMPUTATIONAL
    
    def route(self, query: str) -> str:
        """Route to appropriate handler."""
        query_type = self.classify(query)
        
        if query_type == QueryType.SIMPLE_FACT:
            return self.handle_simple(query)
        elif query_type == QueryType.COMPLEX_REASONING:
            return chain_of_thought(query)
        elif query_type == QueryType.CREATIVE:
            return self.handle_creative(query)
        else:
            return agent_with_tools(query, tools)
    
    def handle_simple(self, query: str) -> str:
        """Fast path for simple queries."""
        response = ollama.chat(model=self.model, messages=[
            {'role': 'user', 'content': f"Answer briefly: {query}"}
        ])
        return response['message']['content']
    
    def handle_creative(self, query: str) -> str:
        """Creative generation with higher temperature."""
        response = ollama.chat(
            model=self.model,
            messages=[{'role': 'user', 'content': query}],
            options={'temperature': 0.8}
        )
        return response['message']['content']

# Usage
router = SemanticRouter()
print(router.route("What's the capital of France?"))  # Simple
print(router.route("Design a sustainable city for 2050"))  # Creative

Pattern: Context Management and Memory

AI applications need memory across turns. Let’s build a conversation manager:

class ConversationMemory:
    def __init__(self, max_tokens: int = 2000):
        self.messages = []
        self.max_tokens = max_tokens
    
    def add(self, role: str, content: str):
        """Add message to history."""
        self.messages.append({'role': role, 'content': content})
        self._trim_if_needed()
    
    def _trim_if_needed(self):
        """Keep only recent messages within token limit."""
        # Rough token estimation: 1 token ≈ 4 chars
        total_chars = sum(len(m['content']) for m in self.messages)
        estimated_tokens = total_chars / 4
        
        while estimated_tokens > self.max_tokens and len(self.messages) > 2:
            # Remove oldest non-system message
            for i, msg in enumerate(self.messages):
                if msg['role'] != 'system':
                    self.messages.pop(i)
                    break
            total_chars = sum(len(m['content']) for m in self.messages)
            estimated_tokens = total_chars / 4
    
    def get_messages(self):
        return self.messages
    
    def summarize_and_compress(self):
        """Compress old messages into summary."""
        if len(self.messages) < 5:
            return
        
        # Take old messages (except most recent 2)
        old_messages = self.messages[:-2]
        recent_messages = self.messages[-2:]
        
        # Summarize
        conversation_text = "\n".join([
            f"{m['role']}: {m['content']}" for m in old_messages
        ])
        
        summary_response = ollama.chat(model='llama3.2', messages=[
            {'role': 'user', 
             'content': f"Summarize this conversation:\n{conversation_text}"}
        ])
        
        summary = summary_response['message']['content']
        
        # Replace with summary
        self.messages = [
            {'role': 'system', 'content': f"Previous conversation summary: {summary}"}
        ] + recent_messages

# Usage
memory = ConversationMemory()
memory.add('system', 'You are a helpful assistant.')
memory.add('user', 'What is machine learning?')
memory.add('assistant', 'Machine learning is...')
memory.add('user', 'Can you give an example?')

response = ollama.chat(model='llama3.2', messages=memory.get_messages())
memory.add('assistant', response['message']['content'])

Semantic Memory: Remember What Matters

Instead of keeping chronological history, index by semantic meaning:

class SemanticMemory:
    def __init__(self):
        self.memories = []  # List of (content, embedding, timestamp)
    
    def remember(self, content: str):
        """Store with semantic embedding."""
        embedding = self._embed(content)
        import time
        self.memories.append((content, embedding, time.time()))
    
    def recall(self, query: str, top_k: int = 3) -> List[str]:
        """Retrieve semantically similar memories."""
        query_emb = self._embed(query)
        
        similarities = []
        for content, emb, timestamp in self.memories:
            sim = np.dot(query_emb, emb) / (
                np.linalg.norm(query_emb) * np.linalg.norm(emb)
            )
            # Decay older memories slightly
            import time
            age_hours = (time.time() - timestamp) / 3600
            decayed_sim = sim * (0.99 ** age_hours)
            similarities.append((content, decayed_sim))
        
        similarities.sort(key=lambda x: x[1], reverse=True)
        return [content for content, _ in similarities[:top_k]]
    
    def _embed(self, text: str) -> np.ndarray:
        response = ollama.embeddings(model='nomic-embed-text', prompt=text)
        return np.array(response['embedding'])

Pattern: Multi-Agent Collaboration

When tasks are complex, divide and conquer with specialized agents:

from typing import List

class Agent:
    def __init__(self, name: str, role: str, model='llama3.2'):
        self.name = name
        self.role = role
        self.model = model
    
    def respond(self, context: str, task: str) -> str:
        """Generate response based on role."""
        prompt = f"""You are {self.name}, {self.role}.

Context: {context}

Task: {task}

Your response:"""
        
        response = ollama.chat(model=self.model, messages=[
            {'role': 'user', 'content': prompt}
        ])
        return response['message']['content']

class MultiAgentSystem:
    def __init__(self, agents: List[Agent]):
        self.agents = agents
        self.conversation_history = []
    
    def collaborate(self, task: str, rounds: int = 2) -> str:
        """Agents take turns addressing the task."""
        context = f"Initial task: {task}"
        
        for round_num in range(rounds):
            for agent in self.agents:
                response = agent.respond(context, task)
                self.conversation_history.append(
                    f"{agent.name}: {response}"
                )
                context += f"\n{agent.name}: {response}"
        
        # Final synthesis
        synthesis_prompt = f"""Review this multi-agent discussion and provide
a final synthesized answer.

Discussion:
{context}

Synthesized answer:"""
        
        final = ollama.chat(model='llama3.2', messages=[
            {'role': 'user', 'content': synthesis_prompt}
        ])
        
        return final['message']['content']

# Example: Interdisciplinary analysis
agents = [
    Agent("Historian", "an expert in Renaissance history"),
    Agent("Technologist", "an expert in printing technology"),
    Agent("Art Critic", "an expert in art history and criticism")
]

system = MultiAgentSystem(agents)
result = system.collaborate(
    "How did printing technology influence Renaissance art?"
)

This pattern shines for problems requiring multiple perspectives: design reviews, research synthesis, strategic planning.

Error Handling and Reliability Patterns

Production AI systems must handle failures gracefully:

from typing import Optional, Callable
import logging

class ReliableAI:
    def __init__(self, model='llama3.2', max_retries: int = 3):
        self.model = model
        self.max_retries = max_retries
    
    def call_with_retry(
        self, 
        messages: list, 
        validator: Optional[Callable] = None
    ) -> str:
        """Call LLM with exponential backoff retry."""
        import time
        
        for attempt in range(self.max_retries):
            try:
                response = ollama.chat(model=self.model, messages=messages)
                result = response['message']['content']
                
                # Validate if validator provided
                if validator and not validator(result):
                    raise ValueError("Output failed validation")
                
                return result
                
            except Exception as e:
                if attempt == self.max_retries - 1:
                    logging.error(f"Failed after {self.max_retries} attempts: {e}")
                    raise
                
                wait_time = 2 ** attempt  # Exponential backoff
                logging.warning(f"Attempt {attempt + 1} failed, retrying in {wait_time}s")
                time.sleep(wait_time)
    
    def call_with_fallback(
        self,
        messages: