Multimodal

Going Beyond Words¶

Picture yourself sitting in a design review, and your colleague pulls up a whiteboard sketch of a new architecture. She draws boxes and arrows, explains data flows, points at components. Nobody asks her to write it all out in a detailed text document first—the drawing is the explanation. That’s because humans are fundamentally multimodal creatures. We think in pictures, diagrams, sounds, and gestures as much as we think in words.

For decades, though, our programming tools have been stubbornly unimodal. We type text into editors, read text in documentation, and describe bugs in lengthy paragraphs. But what if our AI assistants could work the way we actually work? What if you could show them a screenshot of a broken layout, a photo of a whiteboard diagram, or a Figma mockup—and they’d just... understand?

Welcome to multimodal prompting: the practice of giving AI systems more than just text. In this chapter, we’ll explore how combining images, diagrams, and other visual context with our prompts transforms AI from a clever autocomplete tool into a genuine engineering collaborator.

What Makes Prompting “Multimodal”?¶

Multimodal AI processes multiple types of input together—typically text and images, though audio and video are becoming increasingly common. In software engineering, this matters because:

Design is visual - UI mockups, architecture diagrams, and wireframes are the language of intent
Debugging is visual - Stack traces, error screenshots, and console outputs tell stories
Systems are visual - We diagram our architectures because boxes and arrows communicate structure better than prose
Context is visual - A screenshot of a bug beats a thousand-word description

Let’s start with something practical. Here’s how you’d set up a basic multimodal interaction with Ollama using Python:

import requests
from ollama import Client

# Connect to your Ollama server
client = Client(host='localhost:11434')

def analyze_image(image_url, prompt, model="gemma3:latest", temperature=0.3):
    """
    Analyze an image with a text prompt using Ollama's vision capabilities.
    
    Args:
        image_url: URL of the image to analyze
        prompt: Your question or instruction about the image
        model: Vision-capable model (gemma3:latest supports vision)
        temperature: Creativity level (0.0 = focused, 1.0 = creative)
    """
    # Download the image
    headers = {'User-Agent': 'Mozilla/5.0'}
    response = requests.get(image_url, headers=headers)
    response.raise_for_status()
    
    # Send to Ollama with both text and image
    result = client.chat(
        model=model,
        messages=[{
            'role': 'user',
            'content': prompt,
            'images': [response.content]  # Image as bytes
        }],
        options={'temperature': temperature}
    )
    
    return result['message']['content']

This function is deceptively simple, but it’s doing something profound: it’s giving the AI eyes. Let’s put it to work.

Use Case 1: From Mockup to Code in Seconds¶

Frontend development has always had a translation problem. A designer creates a beautiful mockup in Figma. A developer looks at it and mentally translates it into HTML, CSS, and JavaScript. That translation introduces errors, misunderstandings, and delays.

Multimodal prompting short-circuits this process.

Imagine you have a mockup image at design-mockup.png. Here’s how you’d generate the code:

def mockup_to_code(image_url, framework="react"):
    """Convert a UI mockup into working code."""
    
    prompt = f"""Analyze this UI mockup and generate {framework} code.

Include:
- Semantic HTML structure
- Tailwind CSS for styling
- Proper form validation
- Accessible labels and ARIA attributes

Keep the code clean and production-ready."""

    return analyze_image(image_url, prompt, temperature=0.2)

# Usage
mockup_url = "https://example.com/login-mockup.png"
code = mockup_to_code(mockup_url, framework="react")
print(code)

Output (abbreviated):

export default function LoginForm() {
  return (
    <div className="min-h-screen flex items-center justify-center bg-gray-50">
      <div className="max-w-md w-full space-y-8 p-8 bg-white rounded-lg shadow">
        <h2 className="text-3xl font-bold text-center text-gray-900">
          Sign In
        </h2>
        <form className="space-y-6">
          <div>
            <label htmlFor="email" className="block text-sm font-medium">
              Email address
            </label>
            <input
              id="email"
              type="email"
              required
              className="mt-1 block w-full px-3 py-2 border rounded-md"
            />
          </div>
          {/* ... more fields ... */}
        </form>
      </div>
    </div>
  );
}

Notice what happened: the model looked at the visual layout, identified components (heading, form fields, button), understood the spacing and color scheme, and translated it all into working code. No detailed text description needed.

Refining with Visual Feedback¶

But here’s where it gets really powerful. Say you render that code and notice the spacing is off. Instead of describing the problem in text, you can screenshot the result and ask:

def compare_design_to_implementation(mockup_url, screenshot_url):
    """Find differences between design and implementation."""
    
    prompt = """Compare these two images:
    
Image 1: Original design mockup
Image 2: Current implementation

List specific differences:
- Layout and spacing issues
- Color mismatches
- Typography problems
- Alignment issues

Be precise and actionable."""

    # Note: This would require sending both images
    # For now, we'll analyze them separately
    
    mockup_analysis = analyze_image(mockup_url, "Describe layout in detail")
    impl_analysis = analyze_image(screenshot_url, "Describe layout in detail")
    
    return f"Mockup: {mockup_analysis}\n\nImplementation: {impl_analysis}"

This creates a tight feedback loop: design → code → screenshot → refinement. It’s how professionals actually work, now accelerated by AI.

Use Case 2: Debugging with Visual Evidence¶

Here’s a truth about debugging: showing is better than telling. When someone on Stack Overflow posts a question with a screenshot of their error, it gets answered faster than pure text descriptions. Vision-capable AI brings this same principle to your development workflow.

Diagnosing a Runtime Error¶

def debug_from_screenshot(error_screenshot_url, code_snippet):
    """Analyze an error screenshot to diagnose the issue."""
    
    prompt = f"""This screenshot shows a runtime error. 

Context code:
```
{code_snippet}
```

Please:
1. Identify the error type and message
2. Explain the likely root cause
3. Suggest specific fixes
4. Mention any environment or dependency issues

Be precise and solution-oriented."""

    return analyze_image(error_screenshot_url, prompt, temperature=0.1)

# Example usage
error_url = "https://example.com/traceback-screenshot.png"
code = """
def process_data(items):
    total = 0
    for item in items:
        total += item['value']
    return total
"""

diagnosis = debug_from_screenshot(error_url, code)
print(diagnosis)

Typical output:

Error: KeyError: 'value'

Root cause: The code assumes every item in the list has a 'value' key,
but at least one dictionary is missing this key.

Fixes:
1. Add validation: if 'value' in item
2. Use .get() with default: item.get('value', 0)
3. Add try-except handling

Best solution for robustness:
```python
def process_data(items):
    total = 0
    for item in items:
        total += item.get('value', 0)
    return total

This prevents the error while treating missing values as zero.


The model read the stack trace from the image, connected it to your code, and provided contextualized fixes. This is especially valuable for:

- **Build failures**: Screenshot the CI pipeline error
- **CSS layout bugs**: Show what's rendering wrong
- **Mobile responsiveness issues**: Screenshots across devices
- **Configuration errors**: Visual evidence of environment problems

### Finding UI/UX Issues

Sometimes the bug isn't a crash—it's just "something looks wrong."

```python
def analyze_ui_issues(screenshot_url):
    """Identify potential UI/UX problems."""
    
    prompt = """Analyze this UI screenshot for potential issues:

1. Accessibility concerns (contrast, text size, labels)
2. Layout problems (alignment, spacing, responsiveness)
3. Visual hierarchy issues
4. Interactive element visibility
5. Error state handling

Rate severity (Low/Medium/High) and suggest fixes."""

    return analyze_image(screenshot_url, prompt, temperature=0.3)

This turns subjective “it doesn’t feel right” into actionable feedback.

Use Case 3: Architecture and System Design¶

Software architecture is fundamentally visual. We draw boxes for services, arrows for data flow, clouds for external systems. These diagrams capture complexity that would take pages of text to describe.

From Diagram to Infrastructure¶

def diagram_to_terraform(architecture_diagram_url):
    """Convert an architecture diagram into Infrastructure-as-Code."""
    
    prompt = """Analyze this system architecture diagram.

Generate Terraform configuration that includes:
- All services/components shown
- Network topology and connectivity
- Data flow paths
- Security groups and IAM roles
- Appropriate region and availability zone setup

Use AWS as the cloud provider.
Comment each resource with its purpose."""

    return analyze_image(architecture_diagram_url, prompt, temperature=0.2)

# Example
diagram_url = "https://example.com/architecture-v2.png"
terraform = diagram_to_terraform(diagram_url)

# Save to file
with open('infrastructure.tf', 'w') as f:
    f.write(terraform)

Sample output:

# VPC for the application environment
resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  
  tags = {
    Name = "main-vpc"
  }
}

# Public subnet for load balancer
resource "aws_subnet" "public" {
  vpc_id     = aws_vpc.main.id
  cidr_block = "10.0.1.0/24"
  
  tags = {
    Name = "public-subnet"
  }
}

# Application load balancer (shown as entry point in diagram)
resource "aws_lb" "app" {
  name               = "app-lb"
  internal           = false
  load_balancer_type = "application"
  subnets            = [aws_subnet.public.id]
}

# ... more resources based on diagram ...

Reviewing Architecture for Problems¶

But multimodal prompting isn’t just for generation—it’s excellent for analysis:

def review_architecture(diagram_url, concerns=None):
    """Analyze architecture diagram for potential issues."""
    
    focus = concerns or [
        "single points of failure",
        "security vulnerabilities", 
        "scalability bottlenecks",
        "cost optimization opportunities"
    ]
    
    prompt = f"""Review this architecture diagram for:

{chr(10).join(f'- {c}' for c in focus)}

For each issue found:
1. Describe the problem
2. Explain the impact
3. Suggest specific improvements

Be thorough but practical."""

    return analyze_image(diagram_url, prompt, temperature=0.4)

This is like having a senior architect review your design—but instantly available.

Use Case 4: Documentation from Visual Artifacts¶

Good documentation is hard to write and harder to maintain. Multimodal prompting can help by converting visual artifacts into structured text.

Screenshot to API Documentation¶

def screenshot_to_docs(dashboard_screenshot_url):
    """Generate documentation from a dashboard screenshot."""
    
    prompt = """This is a screenshot of an admin dashboard.

Generate comprehensive documentation:

1. Overview section describing the dashboard's purpose
2. Each UI element/section with:
   - Name and function
   - User actions available
   - Data displayed
3. Workflow descriptions
4. Common tasks guide

Format in Markdown with appropriate headers."""

    return analyze_image(dashboard_screenshot_url, prompt, temperature=0.3)

# Generate and save
docs = screenshot_to_docs("https://example.com/admin-panel.png")

with open('admin-dashboard-guide.md', 'w') as f:
    f.write(docs)

Output example:

# Admin Dashboard Guide

## Overview
The admin dashboard provides a centralized interface for system monitoring
and user management, featuring real-time metrics and administrative controls.

## Main Sections

### User Management Panel (Top Left)
- **Purpose**: View and manage registered users
- **Actions Available**:
  - Search users by name or email
  - Edit user permissions
  - Suspend or activate accounts
- **Data Displayed**: Username, email, role, last login

### System Metrics (Top Right)
- **Purpose**: Monitor system health and performance
- **Metrics Shown**:
  - Active users (current: 1,247)
  - Server CPU usage (68%)
  - Database connections (34/100)
...

Whiteboard to Pseudocode¶

One of my favorite uses: converting informal sketches into structured logic.

def whiteboard_to_code(whiteboard_photo_url, language="python"):
    """Convert whiteboard algorithm sketch to clean code."""
    
    prompt = f"""This is a photo of a whiteboard showing an algorithm.

Convert it to clean, working {language} code:
- Add proper function structure
- Include type hints/comments
- Handle edge cases
- Add docstring explaining the algorithm

Make it production-ready."""

    return analyze_image(whiteboard_photo_url, prompt, temperature=0.2)

This bridges the gap between brainstorming and implementation.

Use Case 5: Test Generation from Flows¶

Behavior-driven development (BDD) works from user journeys. Multimodal prompting can turn flow diagrams into executable tests.

User Flow to Test Cases¶

def flow_to_tests(flow_diagram_url, framework="pytest"):
    """Generate test cases from a user flow diagram."""
    
    prompt = f"""Analyze this user flow diagram.

Generate {framework} test cases covering:
- Happy path through the entire flow
- Error states at each decision point  
- Edge cases shown in the diagram
- Validation at each step

Include:
- Descriptive test names
- Setup and teardown
- Assertions for each validation point
- Comments explaining the flow being tested"""

    return analyze_image(flow_diagram_url, prompt, temperature=0.2)

# Example
flow_url = "https://example.com/checkout-flow.png"
tests = flow_to_tests(flow_url, framework="pytest")

with open('test_checkout_flow.py', 'w') as f:
    f.write(tests)

Generated output:

import pytest
from selenium import webdriver
from selenium.webdriver.common.by import By

class TestCheckoutFlow:
    """Tests for the e-commerce checkout process."""
    
    @pytest.fixture
    def driver(self):
        driver = webdriver.Chrome()
        yield driver
        driver.quit()
    
    def test_successful_checkout_happy_path(self, driver):
        """Test complete checkout from cart to confirmation."""
        # Step 1: User views cart with items
        driver.get("https://example.com/cart")
        assert len(driver.find_elements(By.CLASS_NAME, "cart-item")) > 0
        
        # Step 2: Proceed to checkout
        driver.find_element(By.ID, "checkout-button").click()
        
        # Step 3: Enter shipping information
        driver.find_element(By.ID, "shipping-address").send_keys("123 Main St")
        driver.find_element(By.ID, "shipping-city").send_keys("Portland")
        
        # Step 4: Select shipping method
        driver.find_element(By.ID, "standard-shipping").click()
        
        # ... more steps following the flow diagram ...
    
    def test_checkout_fails_with_invalid_card(self, driver):
        """Test error handling for invalid payment information."""
        # Follows the error branch from the flow diagram
        ...

The model extracted every decision point, branch, and validation from the visual flow.

Use Case 6: Code Review with Context¶

Text-only code review misses crucial context. Was this code supposed to implement a specific UI? Does it handle all the states shown in the design?

Contextual Code Review¶

def review_with_context(code_file_path, design_url, feature_screenshot_url):
    """Review code against design intent and actual behavior."""
    
    with open(code_file_path, 'r') as f:
        code = f.read()
    
    # First, understand what was supposed to be built
    intent_prompt = "Describe what this UI should do, focusing on user interactions and visual states."
    intended_behavior = analyze_image(design_url, intent_prompt)
    
    # Then, see what was actually built
    actual_prompt = "Describe what this UI currently does and how it behaves."
    actual_behavior = analyze_image(feature_screenshot_url, actual_prompt)
    
    # Finally, review the code
    review_prompt = f"""Review this code:

{code}


Intended behavior: {intended_behavior}

Actual behavior: {actual_behavior}

Evaluate:
1. Does implementation match intent?
2. Are all visual states handled?
3. Is error handling complete?
4. Any accessibility issues?
5. Code quality and maintainability

Provide specific, actionable feedback."""

    return review_prompt  # Would send to text-only model for analysis

This three-step process (design → implementation → code) provides richer feedback than any single input.

Practical Patterns and Best Practices¶

Through these examples, several patterns emerge for effective multimodal prompting in software engineering:

1. Be Specific About Output Format¶

# Vague
prompt = "Convert this diagram to code"

# Specific  
prompt = """Convert this diagram to Terraform code.

Requirements:
- Use AWS provider version 4.x
- Include all resources shown
- Add comments for each resource
- Follow naming convention: {env}-{service}-{resource}
- Output only valid HCL, no explanation text"""

2. Provide Code Context When Available¶

def analyze_with_context(screenshot_url, related_code):
    prompt = f"""Analyze this screenshot.

Related code context:
```python
{related_code}

Identify discrepancies between visual behavior and code logic.“”"


### 3. Use Temperature Strategically

```python
# Factual analysis: low temperature
errors = analyze_image(error_url, "List all errors shown", temperature=0.1)

# Creative solutions: higher temperature  
ideas = analyze_image(wireframe_url, "Suggest innovative interactions", temperature=0.7)

4. Iterate with Follow-up Questions¶

# Initial analysis
overview = analyze_image(diagram_url, "Describe this architecture")

# Drill down on concerns
security = analyze_image(diagram_url, 
    f"Based on this architecture: {overview}\n\nWhat are the top 3 security risks?")

5. Combine Multiple Images Thoughtfully¶

Some tasks benefit from seeing multiple images:

def compare_screens(before_url, after_url):
    """Compare UI before and after a change."""
    
    # Analyze each separately first for detail
    before = analyze_image(before_url, "Describe layout and functionality")
    after = analyze_image(after_url, "Describe layout and functionality")
    
    # Then synthesize
    comparison = f"""
Before: {before}
After: {after}

What changed? Are there any regressions?
"""
    
    return comparison

Limitations and Gotchas¶

Multimodal prompting is powerful, but it’s not magic. Here are the main pitfalls:

1. Hallucination is More Common¶

Vision models can “see” things that aren’t there:

# Risky: trusting counts without verification
count = analyze_image(ui_url, "How many buttons are visible?")
# Result might be "approximately 7" when there are actually 5

# Better: ask for descriptions
elements = analyze_image(ui_url, "List each button with its text and position")
# Then count programmatically or verify manually

2. Spatial Reasoning Can Be Unreliable¶

# Problematic
layout = analyze_image(page_url, "Is the login button on the left or right?")
# May be inconsistent

# Better
layout = analyze_image(page_url, "Describe the layout from top to bottom")
# Use relative positions: "above", "below", "near"

3. Small Text and Fine Details¶

# Challenging
code = analyze_image(screenshot_url, "Read all the code in this screenshot")
# May miss or misread small text

# Better
code = analyze_image(screenshot_url, "What's the main function signature visible?")
# Focus on specific, larger elements

4. Context Limits Still Apply¶

# Too much
huge_diagram = "enterprise-architecture-full-system.png"  # 8000x6000 px
# May hit size limits or lose detail

# Better
sections = split_diagram_into_sections(huge_diagram)
for section in sections:
    analyze_image(section, f"Describe this part of the architecture")

Ethical Considerations¶

Multimodal AI in software engineering raises important questions:

Privacy: Screenshots might contain sensitive data—API keys, user information, proprietary designs. Always:

Scrub sensitive info before sharing images with AI
Use local/private models for confidential work
Check your company’s AI usage policy

Dependency Risk: Over-reliance on AI for architecture decisions can lead to:

Cargo cult programming (code that “looks right” but isn’t)
Loss of fundamental understanding
Vulnerability to model mistakes

Accessibility: Vision-based tools might exclude developers with visual impairments. Always:

Provide text alternatives for critical workflows
Ensure your tools work with screen readers
Test accessibility of generated code

Looking Forward: The Multimodal Future¶

The gap between “how we think about software” and “how we build software” is closing. Soon, you’ll be able to:

Sketch on a tablet → instant prototype
Describe behavior verbally + show mockup → working feature
Point at legacy code in an IDE → get instant refactoring suggestions with visual impact preview
Diagram a feature → complete test suite

But the real power isn’t automation—it’s alignment. Multimodal prompting helps ensure that what you’re thinking, what you’re designing, and what you’re building are all the same thing.

Bringing It All Together: A Complete Example¶

Let’s end with a realistic scenario that combines everything we’ve learned.

Scenario: Modernizing a Legacy Feature¶

You’re tasked with rebuilding an old admin panel. You have:

A screenshot of the current (legacy) UI
A Figma mockup of the new design
The original backend code
A user flow diagram for the feature

Here’s how multimodal prompting helps:

def modernize_feature(
    legacy_screenshot_url,
    new_design_url, 
    backend_code_path,
    user_flow_url
):
    """Complete feature modernization workflow."""
    
    # Step 1: Understand what exists
    print("Analyzing legacy system...")
    legacy_analysis = analyze_image(
        legacy_screenshot_url,
        "Describe this UI's functionality, layout, and user interactions in detail."
    )
    
    # Step 2: Understand requirements from new design
    print("Analyzing new design...")
    design_analysis = analyze_image(
        new_design_url,
        """Describe this UI design:
        - Components and layout
        - User interactions
        - States (default, loading, error)
        - Differences from a typical admin panel"""
    )
    
    # Step 3: Generate frontend code
    print("Generating new frontend...")
    frontend_code = analyze_image(
        new_design_url,
        f"""Generate a React component matching this design.

Context from legacy system:
{legacy_analysis}

Requirements:
- Use TypeScript
- Tailwind CSS for styling
- React Hook Form for forms
- Handle loading and error states
- Match the design exactly

Include proper TypeScript types.""",
        temperature=0.2
    )
    
    # Step 4: Generate tests from user flow
    print("Generating test suite...")
    tests = analyze_image(
        user_flow_url,
        """Generate Playwright tests for this user flow.

Cover:
- Happy path
- All error branches  
- Edge cases
- Validation errors

Use TypeScript and modern async/await patterns.""",
        temperature=0.2
    )
    
    # Step 5: Verify backend compatibility
    with open(backend_code_path, 'r') as f:
        backend = f.read()
    
    # (This would go to a text-only model)
    compatibility_check = f"""
Given this legacy backend:

{backend}


And this new frontend:

{frontend_code}


1. Are the API contracts compatible?
2. What backend changes are needed?
3. Are there any breaking changes?
4. Suggest migration strategy
"""
    
    return {
        'legacy_analysis': legacy_analysis,
        'design_analysis': design_analysis,
        'frontend_code': frontend_code,
        'tests': tests,
        'compatibility_notes': compatibility_check
    }

# Run the analysis
results = modernize_feature(
    legacy_screenshot_url="https://example.com/old-admin.png",
    new_design_url="https://example.com/new-design.png",
    backend_code_path="backend/admin_controller.py",
    user_flow_url="https://example.com/user-flow.png"
)

# Save outputs
with open('new_admin_panel.tsx', 'w') as f:
    f.write(results['frontend_code'])

with open('admin_panel.spec.ts', 'w') as f:
    f.write(results['tests'])

with open('migration_notes.md', 'w') as f:
    f.write(f"""# Admin Panel Modernization

## Legacy Analysis
{results['legacy_analysis']}

## New Design Analysis  
{results['design_analysis']}

## Backend Compatibility
{results['compatibility_notes']}
""")

print("✓ Modernization package generated!")

This workflow demonstrates the real power: orchestrating multiple multimodal interactions to solve a complex, realistic engineering problem.

Conclusion: Seeing Is Believing¶

Multimodal prompting isn’t just a cool feature—it’s a fundamental shift in how we interact with AI as engineers. By letting models “see” our designs, diagrams, and screenshots, we:

Reduce translation overhead between visual intent and code
Debug faster with visual evidence
Generate infrastructure from architecture diagrams
Create tests from user flows
Review code with full context

Most importantly, we work the way humans naturally work: visually, contextually, iteratively.