Prompting - Programming GenAI

Learning Objectives

After reading this chapter, you will understand:

Prompt engineering and why the structure and wording of prompts fundamentally shapes LLM output quality, drawing on the distinction between how humans understand language and how LLMs perform statistical next-token prediction.
How to configure critical model parameters (temperature, top-k, top-p) to control output randomness and creativity, and justify parameter choices based on specific use cases such as deterministic code generation versus exploratory brainstorming.
The relationship between model parameters and software development lifecycle (SDLC) phases, selecting appropriate temperature and sampling settings for requirements gathering, implementation, testing, and deployment tasks.
Fundamental limitations of LLM reasoning, including susceptibility to misguided attention and pattern-matching errors, and understand the need for verification strategies to catch confident but incorrect outputs.

Conversations with Machines¶

When you first encounter a large language model, the experience can feel almost magical. You type a question, and back comes an answer that sounds remarkably human. But as you use these systems more, you begin to notice something curious: the way you ask the question matters just as much as what you’re asking.

This is the domain of prompting, commonly referred to as prompt engineering—the craft of designing inputs that guide language models to produce exactly the outputs you need.

Let’s start our journey by understanding who we’re actually talking to here.

Understanding How LLMs Really Work¶

Large language models are, at their core, sophisticated prediction engines. They don’t “understand” language in the way you (the reader, assuming you are human) and I do. Instead, they’ve been trained on vast amounts of text to predict what word (or more precisely, what token) should come next in a sequence.

Here’s a simple way to think about it: imagine you’ve read millions of books, articles, and conversations. Someone starts a sentence with “The weather today is...” and your brain immediately starts generating possibilities: “nice,” “terrible,” “unpredictable,” “perfect for a walk.” Your brain is doing something similar to what an LLM does—drawing on patterns it has seen before to predict what comes next.

But there’s a big difference. When you predict what comes next, you’re drawing on genuine understanding of concepts, context, and causality. When an LLM does it, it’s performing an extraordinarily sophisticated pattern matching operation based on statistical regularities it learned during training.

This distinction matters because it shapes how we should interact with these systems. A prompt is not just a question—it’s a way of setting up the model’s prediction machinery to generate a specific kind of continuation.

Let’s see this in action with a simple example:

import ollama

client = ollama.Client(host='http://localhost:11434')

def call_ollama(prompt, model="llama3.2", **options):
    """
    Send a prompt to Ollama and get a response.
    
    Args:
        prompt: The text prompt to send
        model: Which model to use
        **options: Additional parameters (temperature, top_k, etc.)
    
    Returns:
        The model's response as a string
    """
    response = client.generate(
        model=model,
        prompt=prompt,
        options=options
    )
    return response['response']

# Let's see how the model completes different prompts
prompts = [
    "The weather today is",
    "In my professional meteorological opinion, the weather today is",
    "WEATHER ALERT: Today's conditions are"
]

for prompt in prompts:
    response = call_ollama(prompt, temperature=0.7, num_predict=20)
    print(f"Prompt: {prompt}")
    print(f"Completion: {response}\n")

Prompt: The weather today is
Completion: I don't have real-time access to current weather conditions. However, I can suggest ways for you

Prompt: In my professional meteorological opinion, the weather today is
Completion: I'd love to hear your professional meteorological assessment of the current weather conditions. Please go ahead and

Prompt: WEATHER ALERT: Today's conditions are
Completion: ...EXTREME! Unfortunately, I don't have the most up-to-date information on today's

Notice how each prompt “sets up” the model differently. The first is neutral. The second implies we want a formal, expert opinion. The third suggests urgency and official information. The model responds to these cues because its training has taught it that certain language patterns typically follow others.

This is your first lesson in prompt engineering: the prompt is context. You’re not just asking a question—you’re creating a linguistic environment that shapes what the model predicts should come next.

The Control Panel: Model Parameters¶

Before we dive deeper into prompt design, we need to understand the knobs and dials we can turn to control the model’s behavior. Think of these as the difference between asking someone to “suggest a restaurant” versus “list every restaurant in town alphabetically.” The question is similar, but you want very different kinds of responses.

Temperature: Creativity vs. Consistency¶

Temperature controls how random or deterministic the model’s outputs are. It’s measured on a scale from 0.0 to 2.0 (though you’ll rarely use values above 1.5).

Temperature = 0.0: The model always picks the single most likely next token. Completely deterministic.
Temperature = 0.5: Balanced between likely choices and occasional surprises.
Temperature = 1.0: Full probability distribution—creative but coherent.
Temperature = 2.0: Nearly random selection—very creative but often nonsensical.

Here’s why this matters:

def temperature_experiment():
    """
    Demonstrate how temperature affects output consistency and creativity.
    """
    prompt = "Write a creative opening line for a sci-fi story:"
    
    temperatures = [0.0, 0.7, 1.5]
    
    for temp in temperatures:
        print(f"\n{'='*60}")
        print(f"Temperature: {temp}")
        print(f"{'='*60}")
        
        # Generate 3 responses at this temperature
        for i in range(3):
            response = call_ollama(
                prompt, 
                temperature=temp,
                num_predict=30
            )
            print(f"Attempt {i+1}: {response}")

temperature_experiment()


============================================================
Temperature: 0.0
============================================================
Attempt 1: "As the last star in the universe died, Captain Lyra Blackwood gazed out at the endless expanse of darkness, her ship's AI whisper
Attempt 2: "As the last star in the universe died, Captain Lyra Blackwood gazed out at the endless expanse of darkness, her ship's AI whisper
Attempt 3: "As the last star in the universe died, Captain Lyra Blackwood gazed out at the endless expanse of darkness, her ship's AI whisper

============================================================
Temperature: 0.7
============================================================
Attempt 1: "As the last remnants of sunlight faded from the ravaged horizon, Captain Jaxon's comms device crackled to life with an eerie transmission from the
Attempt 2: "As the last remnants of sunlight faded from the ravaged horizon, Captain Lyra Blackwood gazed out at the cosmos with eyes that had witnessed more
Attempt 3: As the last remnants of sunlight faded from the crimson horizon, Captain Lyra Blackwood's eyes locked onto the lone starship that had been drifting through

============================================================
Temperature: 1.5
============================================================
Attempt 1: "As the last star in the galaxy sputtered to life, Dr. Sophia Patel felt an eerie sense of déjà vu, as if she had lived
Attempt 2: As the last stars in the galaxy flickered out like embers from a dying fire, Captain Lyra Blackwood stood at the edge of the abyss
Attempt 3: "As the last remnants of sunlight faded from the horizon, a lone astronaut on Mars gazed out upon a world that was no longer home - and yet

When you run this, you’ll notice something fascinating:

At temperature 0.0, all three attempts produce identical output. The model is completely deterministic.
At 0.7, you get variety, but the responses feel coherent and reasonable.
At 1.5, you might get wild creativity—or occasionally, nonsense.

When to use different temperatures:

0.0 - 0.3: Code generation, factual answers, anything where consistency matters
0.5 - 0.8: General conversation, balanced creativity
0.9 - 1.5: Creative writing, brainstorming, exploring possibilities

Top-K and Top-P: Narrowing the Field¶

Temperature alone doesn’t give us complete control. We also need ways to limit which tokens the model considers at each step.

Top-K sampling limits the model to choosing from only the K most likely tokens:

def demonstrate_topk():
    """
    Show how top-k constrains token selection.
    """
    prompt = "The secret to great coffee is"
    
    # Very restrictive: only top 5 tokens considered
    response_narrow = call_ollama(
        prompt,
        temperature=0.8,
        top_k=5,
        num_predict=30
    )
    
    # More exploratory: top 50 tokens
    response_wide = call_ollama(
        prompt,
        temperature=0.8,
        top_k=50,
        num_predict=30
    )
    
    print("Top-K = 5 (Focused):")
    print(response_narrow)
    print("\nTop-K = 50 (Exploratory):")
    print(response_wide)

demonstrate_topk()

Top-K = 5 (Focused):
...a combination of several factors! Here are some secrets that can help you brew the perfect cup of coffee:

1. **High-quality beans**: Fresh

Top-K = 50 (Exploratory):
A topic of much debate! The secret to great coffee can vary depending on personal taste preferences, but here are some common factors that are often considered essential

Top-P sampling (also called nucleus sampling) is more sophisticated. Instead of a fixed number of tokens, it selects from the smallest set of tokens whose cumulative probability exceeds P:

def demonstrate_topp():
    """
    Show how top-p creates dynamic token sets.
    """
    prompt = "In conclusion, the most important factor is"
    
    # Conservative: only most likely tokens (50% probability mass)
    response_conservative = call_ollama(
        prompt,
        temperature=0.8,
        top_p=0.5,
        num_predict=30
    )
    
    # Exploratory: include less likely tokens (95% probability mass)
    response_exploratory = call_ollama(
        prompt,
        temperature=0.8,
        top_p=0.95,
        num_predict=30
    )
    
    print("Top-P = 0.5 (Conservative):")
    print(response_conservative)
    print("\nTop-P = 0.95 (Exploratory):")
    print(response_exploratory)

demonstrate_topp()

Top-P = 0.5 (Conservative):
...the ability to learn and adapt quickly in a rapidly changing environment. This skill is essential for success in today's fast-paced world, where technological advancements

Top-P = 0.95 (Exploratory):
...the ability to learn and adapt. In today's fast-paced and rapidly changing world, being able to quickly absorb new information, adjust to new situations

The key insight: Top-K gives you a fixed-size pool of options at each step. Top-P adapts the pool size based on how confident the model is. When the model is very sure (like completing “The capital of France is...”), top-p might only consider 2-3 tokens. When it’s less certain, it considers more options.

Combining Parameters: The Recipe for Success¶

These parameters interact in interesting ways. Ollama applies them in sequence:

Top-K filters down to the K most likely tokens
Top-P further filters based on cumulative probability
Temperature is applied to the remaining tokens to determine final selection

Here’s a practical guide for common scenarios:

def parameter_recipes():
    """
    Demonstrate parameter combinations for different use cases.
    """
    test_prompt = "Explain quantum entanglement"
    
    scenarios = {
        "Factual (code, documentation)": {
            "temperature": 0.1,
            "top_k": 20,
            "top_p": 0.5
        },
        "Balanced (general chat)": {
            "temperature": 0.7,
            "top_k": 40,
            "top_p": 0.9
        },
        "Creative (brainstorming)": {
            "temperature": 0.9,
            "top_k": 50,
            "top_p": 0.95
        },
        "Deterministic (testing)": {
            "temperature": 0.0,
            "top_k": 1,
            "top_p": 1.0
        }
    }
    
    for scenario, params in scenarios.items():
        print(f"\n{'='*60}")
        print(f"Scenario: {scenario}")
        print(f"Parameters: {params}")
        print(f"{'='*60}")
        
        response = call_ollama(
            test_prompt,
            num_predict=50,
            **params
        )
        print(response)

parameter_recipes()


============================================================
Scenario: Factual (code, documentation)
Parameters: {'temperature': 0.1, 'top_k': 20, 'top_p': 0.5}
============================================================
Quantum entanglement is a fundamental concept in quantum mechanics that describes the interconnectedness of two or more particles in such a way that their properties are correlated, regardless of the distance between them.

In classical physics, when two objects interact with each other

============================================================
Scenario: Balanced (general chat)
Parameters: {'temperature': 0.7, 'top_k': 40, 'top_p': 0.9}
============================================================
Quantum entanglement is a fundamental concept in quantum mechanics that describes the interconnectedness of particles at a subatomic level. It's a phenomenon where two or more particles become "entangled" in such a way that their properties, such as spin

============================================================
Scenario: Creative (brainstorming)
Parameters: {'temperature': 0.9, 'top_k': 50, 'top_p': 0.95}
============================================================
Quantum entanglement is a fundamental concept in quantum mechanics that describes the interconnectedness of subatomic particles. It's a phenomenon where two or more particles become correlated in such a way that the state of one particle is instantly affected by the state of

============================================================
Scenario: Deterministic (testing)
Parameters: {'temperature': 0.0, 'top_k': 1, 'top_p': 1.0}
============================================================
Quantum entanglement is a fundamental concept in quantum mechanics that describes the interconnectedness of two or more particles in such a way that their properties are correlated, regardless of the distance between them.

**What happens when particles become entangled?**

When

The Software Development Lifecycle: A Parameter Perspective¶

One of the most practical applications of understanding these parameters is knowing when to use which settings during software development. Different phases of the development lifecycle call for different levels of creativity and consistency.

Let me share a story. Last semester, one of my students—let’s call her Maya—was using an LLM to help build a Dart + Flutter mobile app. She was frustrated because the code the model generated during implementation kept changing every time she ran it. Meanwhile, when she asked it to brainstorm features, the responses felt stale and repetitive.

The problem? She was using the same parameters for everything: temperature 0.8, which is perfectly normal for general chat but suboptimal for specialized tasks.

Here’s a way to think about parameters across the development lifecycle:

def sdlc_parameter_guide():
    """
    Demonstrate optimal parameters for each SDLC phase.
    """
    phases = {
        "Requirements & Ideation": {
            "description": "Exploring possibilities, gathering creative solutions",
            "parameters": {"temperature": 0.9, "top_p": 0.95, "top_k": 50},
            "prompt": "Brainstorm 5 innovative features for a task management app"
        },
        "System Design": {
            "description": "Balance creativity with technical soundness",
            "parameters": {"temperature": 0.6, "top_p": 0.85, "top_k": 30},
            "prompt": "Suggest database schemas for a multi-tenant SaaS application"
        },
        "Implementation": {
            "description": "Precise, deterministic code generation",
            "parameters": {"temperature": 0.2, "top_p": 0.7, "top_k": 15},
            "prompt": "Write a Python function to validate email addresses with regex"
        },
        "Testing & QA": {
            "description": "Explore edge cases creatively",
            "parameters": {"temperature": 0.8, "top_p": 0.9, "top_k": 40},
            "prompt": "Generate 10 edge cases for testing a login function"
        },
        "Deployment": {
            "description": "Reliable, repeatable automation",
            "parameters": {"temperature": 0.1, "top_p": 0.6, "top_k": 10},
            "prompt": "Write a CI/CD pipeline configuration for GitHub Actions"
        }
    }
    
    for phase, config in phases.items():
        print(f"\n{'='*70}")
        print(f"PHASE: {phase}")
        print(f"Purpose: {config['description']}")
        print(f"Parameters: {config['parameters']}")
        print(f"{'='*70}")
        
        response = call_ollama(
            config['prompt'],
            num_predict=100,
            **config['parameters']
        )
        print(f"\nExample Output:\n{response}\n")

sdlc_parameter_guide()


======================================================================
PHASE: Requirements & Ideation
Purpose: Exploring possibilities, gathering creative solutions
Parameters: {'temperature': 0.9, 'top_p': 0.95, 'top_k': 50}
======================================================================

Example Output:
Here are five innovative feature ideas for a task management app:

1. **AI-Powered Task Prioritization**: Implement an AI-powered algorithm that analyzes the user's task list and suggests priority levels based on factors such as:
	* Deadline dates
	* Project deadlines
	* Task dependencies
	* User's personal productivity patterns
	* Real-time weather conditions (to consider potential traffic or commute impacts)
	* Sentiment analysis of emails, messages, or other communication related to the tasks


======================================================================
PHASE: System Design
Purpose: Balance creativity with technical soundness
Parameters: {'temperature': 0.6, 'top_p': 0.85, 'top_k': 30}
======================================================================

Example Output:
**Database Schema for Multi-Tenant SaaS Application**
=====================================================

A multi-tenant SaaS application requires a database schema that can accommodate multiple tenants, each with its own set of data. The schema should be designed to ensure data isolation and security.

**Table of Contents**
-----------------

1. [Overview](#overview)
2. [Database Schema](#database-schema)
3. [Tenant Table](#tenant-table)
4. [User Table](#user-table)
5. [Data


======================================================================
PHASE: Implementation
Purpose: Precise, deterministic code generation
Parameters: {'temperature': 0.2, 'top_p': 0.7, 'top_k': 15}
======================================================================

Example Output:
import re

def validate_email(email):
    """
    Validate an email address using regular expression.

    Args:
        email (str): The email address to be validated.

    Returns:
        bool: True if the email is valid, False otherwise.
    """

    # Regular expression pattern for validating email addresses
    pattern = r"^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$"

    # Check if


======================================================================
PHASE: Testing & QA
Purpose: Explore edge cases creatively
Parameters: {'temperature': 0.8, 'top_p': 0.9, 'top_k': 40}
======================================================================

Example Output:
Here are 10 edge cases that can be used to test a login function:

1. **Empty username or password**: Test the login function with an empty username and/or password to ensure it returns an error.

2. **Invalid email address**: Test the login function with an invalid email address, such as one without a "@" symbol or containing non-alphanumeric characters.

3. **Password too short or too long**: Test the login function with passwords that are either too short (less than 8 characters


======================================================================
PHASE: Deployment
Purpose: Reliable, repeatable automation
Parameters: {'temperature': 0.1, 'top_p': 0.6, 'top_k': 10}
======================================================================

Example Output:
Here's an example of a CI/CD pipeline configuration for GitHub Actions:

```yml
name: Build and Deploy

on:
  push:
    branches:
      - main

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - name: Checkout code
        uses: actions/checkout@v2

      - name: Set up Node.js
        uses: actions/setup-node@v2
        with:
          node-version: '

The pattern should be clear: increase temperature and sampling diversity when you want exploration; decrease them when you want consistency.

This isn’t just academic—it has real implications for your work. Maya eventually adjusted her approach: high temperature (0.9) during brainstorming sessions, low temperature (0.1-0.2) during code generation, and back to high temperature (0.8) when generating test cases. Her productivity improved dramatically, and more importantly, she stopped fighting the tool.

The Prompt Itself: Zero-Shot, One-Shot, and Few-Shot Learning¶

Now that we understand how to control the way the model generates responses, let’s focus on controlling what it generates. This is where prompt engineering becomes truly powerful.

Zero-Shot Prompting: The Direct Approach¶

Zero-shot prompting means asking the model to perform a task without providing any examples. You rely entirely on the model’s training to understand what you want:

def zero_shot_classification():
    """
    Classify text using only instructions, no examples.
    """
    def classify_sentiment(review):
        prompt = f"""Classify this movie review as POSITIVE, NEGATIVE, or NEUTRAL.
Return only the classification label.

Review: {review}

Classification:"""
        
        return call_ollama(
            prompt,
            temperature=0.1,
            num_predict=10
        ).strip()
    
    # Test reviews
    reviews = [
        "This movie was absolutely amazing! Best film of the year!",
        "Terrible waste of time. I want my money back.",
        "It was okay. Nothing special but not bad either.",
        "A masterpiece of cinema.",
        "I fell asleep halfway through."
    ]
    
    print("Zero-Shot Sentiment Classification\n" + "="*50)
    for review in reviews:
        sentiment = classify_sentiment(review)
        print(f"\nReview: {review}")
        print(f"Sentiment: {sentiment}")

zero_shot_classification()

Zero-Shot Sentiment Classification
==================================================

Review: This movie was absolutely amazing! Best film of the year!
Sentiment: POSITIVE

Review: Terrible waste of time. I want my money back.
Sentiment: NEGATIVE

Review: It was okay. Nothing special but not bad either.
Sentiment: NEUTRAL

Review: A masterpiece of cinema.
Sentiment: POSITIVE

Review: I fell asleep halfway through.
Sentiment: NEGATIVE

Zero-shot works surprisingly well for many tasks because modern LLMs have been trained on such diverse data. But it has limitations:

Complex reasoning: Multi-step problems often fail
Specific formats: Getting exact JSON or structured output is unreliable
Domain knowledge: Specialized terminology may be misunderstood
Ambiguity: Unclear instructions lead to unpredictable results

One-Shot and Few-Shot: Teaching by Example¶

When zero-shot fails, we add examples. This is called few-shot learning—not because the model is learning in the traditional sense (its weights don’t change), but because it’s adapting its behavior based on the pattern you establish.

Here’s a one-shot example:

def one_shot_extraction():
    """
    Extract structured data using one example.
    """
    def extract_order(text):
        prompt = f"""Parse pizza orders into JSON format.

EXAMPLE:
Input: I want a small pizza with cheese and pepperoni.
Output: {{"size": "small", "toppings": ["cheese", "pepperoni"]}}

Now parse this:
Input: {text}
Output:"""
        
        return call_ollama(prompt, temperature=0.1, num_predict=100)
    
    orders = [
        "Large pizza with mushrooms and olives",
        "I'd like a medium with just cheese please",
        "Extra large with everything"
    ]
    
    print("One-Shot Order Parsing\n" + "="*50)
    for order in orders:
        result = extract_order(order)
        print(f"\nOrder: {order}")
        print(f"Parsed: {result}")

one_shot_extraction()

One-Shot Order Parsing
==================================================

Order: Large pizza with mushrooms and olives
Parsed: Here's how you can do it in Python:

```python
def parse_pizza_order(order):
    # Split the order into words
    words = order.split()

    # Initialize variables to store size, toppings, and sauce
    size = None
    toppings = []
    has_sauce = False

    # Iterate over each word in the order
    for i, word in enumerate(words):
        # Check if it's a size
        if word.lower() == "large"

Order: I'd like a medium with just cheese please
Parsed: Here is the Python code to parse the pizza orders into JSON format:

```python
import json

def parse_pizza_order(order):
    # Remove leading/trailing whitespace and convert to lowercase
    order = order.strip().lower()
    
    # Split the order into words
    words = order.split()
    
    # Initialize variables for size, toppings, and special instructions
    size = None
    toppings = []
    special_instructions = None
    
    # Iterate over each word in the order

Order: Extra large with everything
Parsed: Here's how you can parse the input into JSON format using Python:

```python
import re

def parse_pizza_order(order):
    # Define a dictionary to map pizza sizes and toppings
    size_map = {
        "small": ["cheese", "pepperoni"],
        "medium": ["mushrooms", "onions", "green peppers"],
        "large": ["sauces", "meatballs", "extra cheese"],
        "x-large": ["anchovies

And here’s a more powerful few-shot version:

def few_shot_classification():
    """
    Classify emails using multiple examples to establish pattern.
    """
    def classify_email(email_body):
        prompt = f"""Classify emails as SPAM, IMPORTANT, or NORMAL.

Example 1:
Email: "Congratulations! You've won $1,000,000! Click here now!"
Classification: SPAM

Example 2:
Email: "Meeting with CEO rescheduled to tomorrow 9am. Please confirm."
Classification: IMPORTANT

Example 3:
Email: "Weekly newsletter: Here are this week's top articles."
Classification: NORMAL

Example 4:
Email: "Your account will be closed unless you verify within 24 hours!"
Classification: SPAM

Example 5:
Email: "Board meeting agenda attached. Review before Friday."
Classification: IMPORTANT

Now classify:
Email: {email_body}
Classification:"""
        
        return call_ollama(prompt, temperature=0.1, num_predict=10).strip()
    
    test_emails = [
        "URGENT: Limited time offer! Buy now!",
        "Q4 financial results ready for your review. Call me.",
        "Thanks for subscribing to our blog updates.",
        "Your package has been shipped and will arrive Tuesday.",
        "You are a winner! Claim your free iPhone now!"
    ]
    
    print("Few-Shot Email Classification\n" + "="*50)
    for email in test_emails:
        classification = classify_email(email)
        print(f"\nEmail: {email[:60]}...")
        print(f"Classification: {classification}")

few_shot_classification()

Few-Shot Email Classification
==================================================

Email: URGENT: Limited time offer! Buy now!...
Classification: SPAM

Email: Q4 financial results ready for your review. Call me....
Classification: Based on the content of the email, I would

Email: Thanks for subscribing to our blog updates....
Classification: Based on the content of the email, I would

Email: Your package has been shipped and will arrive Tuesday....
Classification: Based on the content of the email, I would

Email: You are a winner! Claim your free iPhone now!...
Classification: SPAM

Key principles for few-shot prompting:

3-6 examples is the sweet spot: Too few and the pattern isn’t clear; too many and you waste context window space.
Diversity matters: Your examples should cover the range of inputs you expect. Don’t use five examples that are all basically the same.
Quality over quantity: One excellent, clear example is worth three mediocre ones.
Order can matter: Some models are sensitive to example order, though this varies.

Role Playing and System Prompts¶

One of the most powerful techniques in prompt engineering is giving the model a role or persona. This isn’t just theatrical—it’s a way of activating different patterns in the model’s training data.

The Power of System Prompts¶

In most modern LLM APIs, prompts are structured as conversations with different roles:

system: Sets overall behavior and constraints
user: Represents the human’s input
assistant: The model’s previous responses (for context)

The system prompt is particularly powerful because it establishes the framing for the entire conversation:

def chat_ollama(messages, model="llama3.2", **options):
    """
    Send a chat-formatted conversation to Ollama.
    """
    response = client.chat(
        model=model,
        messages=messages,
        options=options
    )
    return response['message']['content']

def demonstrate_system_prompts():
    """
    Show how system prompts change model behavior.
    """
    question = "Explain how neural networks learn."
    
    scenarios = [
        {
            "name": "No System Prompt",
            "system": None,
            "description": "Baseline response"
        },
        {
            "name": "Concise Expert",
            "system": "You are a concise technical expert. Maximum 3 sentences.",
            "description": "Brief, dense explanation"
        },
        {
            "name": "Patient Teacher",
            "system": "You are a patient teacher explaining to someone new to the field. Use analogies and simple language.",
            "description": "Accessible explanation"
        },
        {
            "name": "Socratic Questioner",
            "system": "You answer questions by asking clarifying questions to help the user think through the problem themselves.",
            "description": "Guides rather than tells"
        }
    ]
    
    for scenario in scenarios:
        print(f"\n{'='*70}")
        print(f"Scenario: {scenario['name']}")
        print(f"Goal: {scenario['description']}")
        print(f"{'='*70}\n")
        
        messages = []
        
        if scenario['system']:
            messages.append({
                'role': 'system',
                'content': scenario['system']
            })
        
        messages.append({
            'role': 'user',
            'content': question
        })
        
        response = chat_ollama(messages, temperature=0.7)
        print(response)

demonstrate_system_prompts()


======================================================================
Scenario: No System Prompt
Goal: Baseline response
======================================================================

Neural networks are a type of machine learning model that is inspired by the structure and function of the human brain. They consist of layers of interconnected nodes or "neurons" that process and transmit information.

Here's a simplified explanation of how neural networks learn:

**Step 1: Input Data**

The network receives input data, which can be images, text, audio, or any other type of data that can be represented as a numerical vector. The data is typically pre-processed to prepare it for use in the network.

**Step 2: Forward Propagation**

The input data flows through the network, layer by layer, with each node performing a computation on the incoming data. This process is called forward propagation. Each node applies an activation function to its output, which determines whether the node fires (produces an output) or remains dormant.

**Step 3: Error Calculation**

Once the output from the final layer has been computed, the network calculates the error between its predicted output and the actual output. This is done using a loss function, such as mean squared error or cross-entropy.

**Step 4: Backward Propagation**

The error is then propagated backward through the network, layer by layer, with each node adjusting its weights (connections to other nodes) based on the error gradient. The goal is to minimize the overall loss function.

**Step 5: Weight Update**

Each weight is updated based on the error gradient and a learning rate, which determines how quickly the weights are adjusted. The update rule for each weight involves multiplying it by the learning rate and adding a correction term that depends on the error gradient.

**Step 6: Optimization**

The process of updating weights is repeated multiple times, with each iteration involving forward propagation, error calculation, backward propagation, and weight updates. This process is called an optimization loop.

**Key Concepts**

Several key concepts are crucial to understanding how neural networks learn:

1. **Activation Functions**: These functions introduce non-linearity into the network, allowing it to learn complex relationships between inputs and outputs.
2. **Backpropagation**: This is the algorithm used to update weights based on error gradients.
3. **Optimization Algorithms**: Such as stochastic gradient descent (SGD) or Adam, which control the learning rate and optimization process.

**Learning Process**

The learning process can be described as follows:

1. **Initialization**: The network is initialized with random weights and biases.
2. **Forward Pass**: The input data flows through the network, computing outputs at each layer.
3. **Error Calculation**: The error between predicted output and actual output is calculated using a loss function.
4. **Backward Pass**: The error gradient is computed and propagated backward through the network to update weights.
5. **Weight Update**: Weights are updated based on the error gradient and learning rate.

This process continues until convergence, where the network has learned to minimize the loss function and make accurate predictions on new, unseen data.

======================================================================
Scenario: Concise Expert
Goal: Brief, dense explanation
======================================================================

Neural networks learn through a process called backpropagation, where the network is trained on a dataset to minimize the error between predicted outputs and actual targets. During training, the network adjusts its weights and biases to optimize the prediction, with errors being propagated backwards through the network to update each layer's parameters. This iterative process converges on a set of optimal weights that enable the network to accurately predict future inputs.

======================================================================
Scenario: Patient Teacher
Goal: Accessible explanation
======================================================================

Let's talk about neural networks in a way that's easy to understand.

Imagine you're trying to recognize different types of animals, like dogs and cats. You want your system to be able to look at a picture of a dog and say "that's a dog!" or a cat and say "that's a cat!".

A traditional approach would be to show the system a huge list of pictures, one by one, and tell it which category each picture belongs to (dog or cat). But this can get tedious and expensive.

Neural networks are like super-smart, pattern-recognition monkeys. They learn from examples just like you do in school, but instead of memorizing facts, they figure out how to recognize patterns on their own.

Here's how it works:

1. **Training data**: We show the neural network a huge collection of pictures, each labeled as either "dog" or "cat". This is called the training data.
2. **Network building blocks**: Inside the neural network, there are many tiny units called neurons (like little brain cells). Each neuron looks at a small part of an image, like just one feature (like "is this picture showing a dog's face?").
3. **Connection-making**: These tiny neurons connect to other neurons in the network, forming a web-like structure. This connection is crucial – it allows the network to share information and learn together.
4. **Error correction**: When we show the neural network an image labeled as "dog", but it makes a mistake and says "cat" instead, that's called an error. The neural network uses this error to adjust its connections between neurons, making them stronger if they got it right, or weaker if they made a mistake.
5. **Learning loop**: As we repeat the process of showing the neural network many examples (training data), it refines its connections and becomes better at recognizing patterns.

Think of it like playing a game where you try to find matching cards. Initially, your brain might not be able to distinguish between similar-looking cards. But with practice and repetition, you develop patterns and learn to recognize the differences – that's what neural networks do too!

Over time, as more examples are added (training data), the neural network becomes incredibly skilled at recognizing different animals based on their features. This is how they "learn" to make predictions about new images it sees.

Does this explanation help clarify how neural networks learn?

======================================================================
Scenario: Socratic Questioner
Goal: Guides rather than tells
======================================================================

To explain how neural networks learn, let's break it down step by step.

Neural networks are inspired by the human brain and are composed of multiple layers of interconnected nodes (or "neurons") that process inputs and produce outputs. So, what do you think drives this learning process? Is it something related to the data itself or perhaps the way we're training these models?

In other words, what kind of changes occur in the neural network as it learns from its environment? Are they more like adjustments to individual weights and biases, or is there some sort of global reorganization happening at a higher level?

To help answer this question, let's consider how neural networks are typically trained. They're often trained using an optimization algorithm that minimizes the difference between predicted outputs and actual outputs (e.g., through mean squared error). But what exactly do we want these networks to minimize? Is it just the raw error or is there a way to incorporate some sort of feedback loop or regularization mechanism?

By exploring these questions, I hope you can help shed light on how neural networks learn. What are your thoughts on this topic?

Role-Based Prompting for Domain Expertise¶

You can use roles to activate domain-specific knowledge. This technique is incredibly useful when you need specialized perspectives on a problem. The model has been trained on text from many different domains, and role prompts help surface relevant patterns.

def role_based_consultation():
    """
    Get perspectives from different professional roles.
    """
    def consult_expert(question, role, expertise):
        messages = [
            {
                'role': 'system',
                'content': f"You are a {role} with deep expertise in {expertise}. Provide advice from your professional perspective."
            },
            {
                'role': 'user',
                'content': question
            }
        ]
        return chat_ollama(messages, temperature=0.6)
    
    problem = "Our web application is slow. How should we diagnose and fix it?"
    
    experts = [
        ("Database Administrator", "query optimization and indexing"),
        ("Frontend Developer", "client-side performance and rendering"),
        ("DevOps Engineer", "infrastructure and scaling")
    ]
    
    print(f"Problem: {problem}\n")
    print("="*70)
    
    for role, expertise in experts:
        print(f"\nConsulting: {role} ({expertise})")
        print("-"*70)
        advice = consult_expert(problem, role, expertise)
        print(advice)
        print()

role_based_consultation()

Problem: Our web application is slow. How should we diagnose and fix it?

======================================================================

Consulting: Database Administrator (query optimization and indexing)
----------------------------------------------------------------------
Diagnosing and fixing slow performance issues in a web application can be a challenging task, but I'll provide a step-by-step approach to help you identify the root cause and implement effective solutions.

**Phase 1: Gathering Information**

1. **Collect metrics**: Set up monitoring tools like New Relic, Datadog, or Prometheus to collect data on CPU usage, memory usage, disk I/O, network traffic, and response times.
2. **Identify slow queries**: Use the database's built-in query analysis tool (e.g., SQL Server's Query Store) or a third-party tool like Query Profiler or ExPLAIN to analyze slow-running queries.
3. **Gather user feedback**: Ask users about their experiences with your application, including any specific issues they've encountered.

**Phase 2: Diagnosing the Root Cause**

1. **Analyze query performance**: Use the data collected in Phase 1 to identify slow-running queries. Look for queries that are:
	* High-impact (i.e., affecting a large number of users).
	* Resource-intensive (e.g., high CPU or memory usage).
2. **Examine database statistics**: Check database statistics like index coverage, table fragmentation, and row count.
3. **Investigate indexing issues**: Verify that indexes are correctly created and maintained for frequently accessed data.

**Phase 3: Optimizing Queries**

1. **Optimize queries using SQL techniques**:
	* Use efficient query structures (e.g., JOINs instead of subqueries).
	* Apply indexing, caching, or replication as needed.
	* Avoid correlated subqueries.
2. **Leverage database features**: Utilize database features like:
	* Materialized views for pre-computed results.
	* Common Table Expressions (CTEs) for complex queries.
3. **Consider query rewriting**:
	* Use query rewriting tools or languages like SQL Server's Query Rewrite feature.

**Phase 4: Enhancing System Performance**

1. **Optimize database configuration**: Adjust database settings, such as buffer pool size, log file size, and connection timeouts.
2. **Upgrade hardware**: Consider upgrading server resources (e.g., CPU, RAM, storage) to improve overall performance.
3. **Implement caching mechanisms**:
	* Use caching libraries or frameworks like Redis or Memcached.
	* Implement cache invalidation strategies.

**Phase 5: Testing and Refining**

1. **Test optimized queries**: Verify that the optimizations have improved query performance.
2. **Refine and iterate**: Continuously monitor performance and refine your approach as needed.

Additional tips:

* Regularly review and maintain database statistics to ensure optimal indexing and query performance.
* Use profiling tools to identify slow-running code paths in your application.
* Consider implementing a content delivery network (CDN) or edge caching to reduce latency for static resources.

By following these steps, you'll be well on your way to diagnosing and fixing slow performance issues in your web application.


Consulting: Frontend Developer (client-side performance and rendering)
----------------------------------------------------------------------
Diagnosing and fixing a slow web application requires a structured approach to identify the root cause of the issue. Here's a step-by-step guide to help you diagnose and fix performance-related issues:

**Diagnosis**

1. **Gather metrics**: Collect data on your application's performance using tools like:
	* Google Analytics (e.g., page load time, bounce rate)
	* New Relic or similar monitoring tools
	* Browser developer tools (e.g., Chrome DevTools, Firefox Developer Edition)
2. **Identify slow components**: Analyze the metrics to pinpoint specific areas of the application that are causing delays.
3. **Use browser performance profiling**:
	* Record a performance profile using Chrome DevTools or Firefox Developer Edition
	* Run the profile in "Harmony" mode (for Chrome) or "Bottleneck" mode (for Firefox)
	* Identify the most time-consuming components, such as JavaScript files, images, or network requests
4. **Inspect the DOM**: Use browser developer tools to inspect the Document Object Model (DOM) and identify potential bottlenecks:
	* Look for slow CSS animations, transitions, or layout calculations
	* Check for excessive DOM mutations or unnecessary DOM updates

**Fixing Performance Issues**

1. **Optimize images**:
	* Compress images using tools like ImageOptim or ShortPixel
	* Use image formats that are suitable for web use (e.g., WebP, PNG)
2. **Minify and compress JavaScript and CSS files**:
	* Use tools like Gzip, Brotli, or a minifier to reduce file size
	* Remove unnecessary code, comments, or whitespace
3. **Use caching and lazy loading**:
	* Implement caching mechanisms for frequently accessed resources (e.g., images, stylesheets)
	* Use lazy loading to load non-essential content only when needed
4. **Improve network performance**:
	* Optimize server-side rendering and API responses
	* Reduce the number of HTTP requests by combining files or using a CDN
5. **Use JavaScript optimizations**:
	* Minimize DOM mutations by using a virtual DOM or a library like React or Vue.js
	* Use async/await, promises, or Web Workers to offload computationally expensive tasks
6. **Profile and optimize critical sections of code**:
	* Identify performance-critical areas using browser profiling tools
	* Optimize these areas using techniques like memoization, caching, or parallel processing

**Additional Tips**

1. **Test on different devices and browsers**: Ensure your application works well across various devices and browsers.
2. **Monitor performance regularly**: Continuously monitor your application's performance to identify ongoing issues.
3. **Use a Content Delivery Network (CDN)**: Consider using a CDN to distribute static assets and reduce latency.

By following this structured approach, you should be able to diagnose and fix performance-related issues in your web application.


Consulting: DevOps Engineer (infrastructure and scaling)
----------------------------------------------------------------------
Diagnosing and fixing performance issues like slow web applications requires a structured approach. Here's a step-by-step guide to help you get started:

**Diagnosis**

1. **Gather metrics**: Collect relevant performance metrics, such as:
	* Response time (e.g., average response time, 50th percentile response time)
	* Throughput (e.g., requests per second, users per minute)
	* Error rates
	* Resource utilization (e.g., CPU, memory, disk usage)
2. **Identify bottlenecks**: Use tools like:
	* New Relic
	* Datadog
	* Prometheus
	* Grafana
	* Performance monitoring dashboards (e.g., AWS CloudWatch, Google Cloud Logging)
3. **Analyze user feedback**: Collect and analyze user feedback through:
	* User surveys
	* Crash reports
	* Error logs
4. **Run load testing**: Perform load testing using tools like:
	* Apache JMeter
	* Gatling
	* Locust
5. **Monitor system performance**: Regularly check system performance metrics, such as:
	* CPU usage
	* Memory usage
	* Disk space

**Fixing**

1. **Optimize database queries**: Review and optimize database queries to reduce the load on your application.
2. **Improve caching mechanisms**: Implement effective caching strategies, such as:
	* Cache invalidation
	* Cache expiration
3. **Leverage content delivery networks (CDNs)**: Use CDNs to distribute static assets and reduce latency.
4. **Optimize server configuration**: Review server configuration settings, such as:
	* CPU frequency scaling
	* Memory allocation
5. **Upgrade infrastructure**: Consider upgrading your infrastructure to better handle increased traffic, such as:
	* Upgrading servers or adding more servers
	* Implementing load balancers
6. **Implement serverless architecture**: Consider migrating to a serverless architecture to reduce costs and improve scalability.
7. **Optimize image compression**: Optimize images for faster loading times using tools like:
	* ImageOptim
8. **Use a reverse proxy**: Use a reverse proxy (e.g., NGINX, Apache) to distribute traffic and improve performance.

**Scaling**

1. **Design for scalability**: Plan your application's architecture to scale horizontally.
2. **Use auto-scaling**: Implement auto-scaling using tools like:
	* AWS Auto Scaling
	* Google Cloud Autoscaling
3. **Implement load balancing**: Use load balancers to distribute traffic across multiple servers.
4. **Monitor performance metrics**: Continuously monitor performance metrics and adjust scaling strategies as needed.

**Best Practices**

1. **Use A/B testing**: Regularly test different configurations to find the best approach for your application.
2. **Continuously integrate and deploy**: Implement continuous integration and deployment (CI/CD) pipelines to ensure smooth deployments and rapid feedback loops.
3. **Monitor performance regularly**: Schedule regular performance monitoring sessions to catch issues before they become major problems.

By following these steps, you'll be well on your way to diagnosing and fixing performance issues in your web application.

Chain of Thought: Teaching the Model to Reason¶

Here’s where things get really interesting. One of the most significant discoveries in prompt engineering is that you can dramatically improve model performance on reasoning tasks by asking it to “show its work.”

This technique is called Chain of Thought (CoT) prompting.

Zero-Shot Chain of Thought¶

The simplest version is almost embarrassingly effective. Just add “Let’s think step by step” to your prompt:

def zero_shot_cot():
    """
    Demonstrate zero-shot Chain of Thought reasoning.
    """
    problem = """When I was 6 years old, my sister was half my age.
Now I'm 70 years old. How old is my sister?"""
    
    # Without CoT
    print("WITHOUT Chain of Thought:")
    print("-"*60)
    response = call_ollama(problem, temperature=0.0, num_predict=100)
    print(response)
    
    # With CoT
    print("\n\nWITH Chain of Thought:")
    print("-"*60)
    cot_prompt = f"{problem}\n\nLet's think step by step:"
    response = call_ollama(cot_prompt, temperature=0.0, num_predict=150)
    print(response)

zero_shot_cot()

WITHOUT Chain of Thought:
------------------------------------------------------------
Let's break it down step by step:

1. When you were 6 years old, your sister was half your age, which means she was 3 years old (since 6 / 2 = 3).
2. Now, you are 70 years old.
3. Since the age difference between you and your sister remains constant over time, we can set up an equation to find her current age.

Let x be your sister's current age. Then, the age difference between you


WITH Chain of Thought:
------------------------------------------------------------
To solve this problem, let's break it down step by step.

Step 1: When you were 6 years old, your sister was half your age. This means that when you were 6, your sister was 6 / 2 = 3 years old.

Step 2: Now, we need to find out how many years have passed since you were 6 years old. Since you are now 70 years old, the number of years that have passed is 70 - 6 = 64 years.

Step 3: Since your sister was 3 years old when you were 6, and 64 years have passed since then, we need to add these 64 years to your sister's age at that time

Why does this work? When you ask the model to think step-by-step, you’re actually asking it to generate intermediate reasoning tokens before the final answer. This changes the computational path the model takes through the problem. Without CoT, the model tries to jump directly from question to answer—which works for simple problems but fails for complex reasoning. With CoT, it generates a sequence of reasoning steps, and each step provides context that helps generate the next step.

You’ll also have noticed that the model does the same thing for this particular problem: even without being prompted to reason “step by step” it almost prompts itself to apply COT for a certain class of problems. This is actually the case for many modern high-quality models.

When Chain of Thought Fails: The Limits of Reasoning¶

It’s crucial to understand that CoT isn’t magic. LLMs still don’t “think” or “reason” in the way humans do. They’re generating text that looks like reasoning, which often leads to correct answers, but can also lead to confident-sounding nonsense.

Let me show you a famous failure case:

def cot_failure_example():
    """
    Demonstrate where CoT reasoning can fail.
    """
    # Modified river crossing problem
    problem = """A farmer is on one side of a river with a wolf, a goat, and a cabbage.
When crossing in a boat, he can only take one item at a time.
The wolf will eat the goat if left alone together.
The goat will eat the cabbage if left alone together.

How can the farmer transport the goat across the river without it being eaten?

Let's think through this step by step:"""
    
    response = call_ollama(problem, temperature=0.5, num_predict=300)
    print("CoT Response to Modified Problem:")
    print("="*60)
    print(response)
    print("\n" + "="*60)
    print("Analysis: The problem only asks about transporting the GOAT,")
    print("not all three items. The correct answer is simply:")
    print("'Put the goat in the boat and transport it across.'")
    print("\nBut the model may provide an unnecessarily complex solution")
    print("because it pattern-matches to the classic river-crossing puzzle.")

cot_failure_example()

CoT Response to Modified Problem:
============================================================
To solve this problem, we'll go through the steps:

1. The farmer takes the goat across the river first and leaves it on the other side.
2. The farmer goes back to the original side and picks up the cabbage.
3. The farmer takes the cabbage across the river but leaves it on the other side with the goat. This is safe because the wolf can't be left alone with the goat, so the goat won't eat the cabbage.

Now we have the goat and the cabbage on the other side of the river:

4. The farmer goes back to the original side and picks up the goat.
5. He takes the goat back to the original side (because now he has the wolf with him) and leaves it there, but this time, he also leaves the goat alone with the wolf is not an option, so he must leave the goat on the other side of the river.
6. The farmer picks up the cabbage from the other side and takes it back to the original side. 
7. He then leaves the cabbage on the original side and goes back to pick up the goat again.

Now we have:

* The farmer on the original side with the wolf
* The goat on the other side of the river

8. The farmer picks up the goat from the other side, takes it across the river, but this time leaves it on the other side and then goes back to pick up the cabbage.
9. He then leaves the cabbage on the other side

============================================================
Analysis: The problem only asks about transporting the GOAT,
not all three items. The correct answer is simply:
'Put the goat in the boat and transport it across.'

But the model may provide an unnecessarily complex solution
because it pattern-matches to the classic river-crossing puzzle.

This example illustrates a fundamental limitation: LLMs can be led astray by misguided attention. They pattern-match to familiar problems even when the actual problem is different. The model has seen many river-crossing puzzles in its training data, so it activates those patterns even though the question is much simpler.

This is why careful prompt engineering matters, and why you should never trust an LLM’s output without verification, especially for critical applications.

Conclusion¶

Prompt Engineering is a new field where creativity and technical skill combine in fascinating ways.

The prompts you write today might seem crude in a year. But they’re part of learning a new way to work with computers—not through rigid programming languages, but through natural language that guides intelligent systems.