Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Teaching Machines to Understand Programs

What if we could teach computers to understand code the way they understand human language? What if a program could recognize that for i in range(n) and while i < n are semantically related, even though they use different syntax? This is possible with code embeddings, where we transform the discrete, symbolic nature of programming languages into continuous mathematical spaces that machines can reason about.

The Vector Space of Code

The key insight is that code has structure that generic text models miss. Operators matter more than variable names. Indentation carries meaning. A single character change (< to >) can completely reverse logic. Code-specific embeddings learn these patterns from millions of examples, giving us representations that reflect how programmers actually think about code.

What Are Code Embeddings?

Recall that an embedding is a learned mapping from discrete objects (like words, sentences, or code) into continuous vector spaces. Instead of treating code as raw text, we represent it as points in a mathematical space where:

  • Distance captures semantic similarity

  • Direction encodes meaningful relationships

  • Arithmetic reveals hidden patterns

Why Code Embeddings?

Code has semantic meaning just like natural language:

  • Two functions might solve the same problem differently

  • Comments and variable names carry intent

  • Code structure reveals algorithmic approach

For example, these Python functions do the same thing:

def sum_list_v1(numbers):
    total = 0
    for n in numbers:
        total += n
    return total

def sum_list_v2(lst):
    return sum(lst)

A good embedding should recognize their similarity!

Code Embedding Models

Modern code embeddings come from models trained on millions of repositories.

UniXcoder (2022) is a unified model for multiple languages which significantly outperforms older models and benefits from recent advances in transformers. It is currently one of the best open-source models for code understanding tasks.

Let’s start with a simple example using a pre-trained model.

# Using UniXcoder for code embeddings
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np

tokenizer = AutoTokenizer.from_pretrained("microsoft/unixcoder-base")
model = AutoModel.from_pretrained("microsoft/unixcoder-base")

def embed_code(code_snippet):
        """Generate embedding for text using mean pooling."""
        # Tokenize
        inputs = self.tokenizer(text, return_tensors="pt", 
                               truncation=True, max_length=512,
                               padding=True)
        
        # Generate embedding
        with torch.no_grad():
            outputs = self.model(**inputs)
        
        # Mean pooling over all tokens (best practice for embeddings)
        attention_mask = inputs['attention_mask']
        token_embeddings = outputs.last_hidden_state
        
        # Weighted average using attention mask
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        embedding = (sum_embeddings / sum_mask).squeeze().numpy()
        
        return embedding

def cosine_similarity(emb1, emb2):
    """Compute cosine similarity between two embeddings."""
    # Ensure numpy arrays
    if torch.is_tensor(emb1):
        emb1 = emb1.cpu().numpy()
    if torch.is_tensor(emb2):
        emb2 = emb2.cpu().numpy()
    
    # Compute cosine similarity
    dot_product = np.dot(emb1, emb2)
    norm1 = np.linalg.norm(emb1)
    norm2 = np.linalg.norm(emb2)
    
    return dot_product / (norm1 * norm2)

# Compare two functionally equivalent sum functions
emb1 = embed_code("""
def change_list(numbers):
    total = 0
    for n in numbers:
        total += n
    return total
""")

emb2 = embed_code("""
def sum_list(lst):
    return sum(lst)
""")

emb3 = embed_code("""
def multiply_list(numbers):
    result = 1
    for n in numbers:
        result *= n
    return result
""")

similarity_same = cosine_similarity(emb1, emb2)
similarity_different = cosine_similarity(emb1, emb3)

print(f"Similarity (iterative sum vs built-in sum): {similarity_same:.4f}")
print(f"Similarity (sum vs multiply):               {similarity_different:.4f}")
print(f"\nThe model recognizes the two sum functions are semantically similar!")
print(f"despite different implementations (similarity: {similarity_same:.4f})")
Similarity (iterative sum vs built-in sum): 0.6089
Similarity (sum vs multiply):               0.8638

The model recognizes the two sum functions are semantically similar!
despite different implementations (similarity: 0.6089)

It’s also interesting to note the difference between sum and multiply is significant, even though we only changed one operator (and the name and intialization value for result).


# Test variations of the same function
original = """def add_numbers(a, b):
    '''Add two numbers together.'''
    return a + b"""

variations = {
    "renamed_vars": """def add_numbers(x, y):
    '''Add two numbers together.'''
    return x + y""",
    
    "different_name": """def sum_values(a, b):
    '''Add two numbers together.'''
    return a + b""",
    
    "with_typing": """def add_numbers(a: int, b: int) -> int:
    '''Add two numbers together.'''
    return a + b""",
    
    "different_logic": """def multiply_numbers(a, b):
    '''Multiply two numbers together.'''
    return a * b""",
    
    "unrelated": """def send_email(recipient, message):
    '''Send an email to recipient.'''
    print(f"Sending to {recipient}: {message}")"""
}

# Compare all variations to original
base_emb = embed_code(original).reshape(1, -1)

print("Similarity to Original:\n")
results = []
for name, code in variations.items():
    emb = embed_code(code).reshape(1, -1)
    sim = cosine_similarity(base_emb, emb)[0][0]
    results.append((name, sim))

# Sort by similarity (highest first)
results.sort(key=lambda x: x[1], reverse=True)

for name, sim in results:
    print(f"{sim:.4f}  {name}")
Similarity to Original:

0.9597  with_typing
0.9493  different_name
0.9301  renamed_vars
0.8477  different_logic
0.3574  unrelated
# Original function
original = """def validate_age(age):
    '''Check if age meets minimum requirement.'''
    if age < 18:
        return False
    return True"""

# Test cases: progressively change operators
test_cases = {
    "original": {
        "code": """def validate_age(age):
    '''Check if age meets minimum requirement.'''
    if age < 18:
        return False
    return True""",
        "changes": "None (baseline)",
        "description": "Original: age < 18"
    },
    
    "one_operator": {
        "code": """def validate_age(age):
    '''Check if age meets minimum requirement.'''
    if age > 18:
        return False
    return True""",
        "changes": "1 operator: < → >",
        "description": "Changed: age > 18 (completely different logic)"
    },
    
    "two_operators": {
        "code": """def validate_age(age):
    '''Check if age meets minimum requirement.'''
    if age > 18:
        return True
    return False""",
        "changes": "2 operators: < → >, flip returns",
        "description": "Changed: age > 18, swapped True/False"
    },
    
    "three_operators": {
        "code": """def validate_age(age):
    '''Check if age meets minimum requirement.'''
    if age >= 21:
        return True
    return False""",
        "changes": "3 operators: < → >=, 18 → 21, flip returns",
        "description": "Changed: age >= 21, swapped True/False"
    },
    
    "cosmetic_only": {
        "code": """def validate_age(user_age):
    '''Check if age meets minimum requirement.'''
    if user_age < 18:
        return False
    return True""",
        "changes": "0 operators (cosmetic: age → user_age)",
        "description": "Only variable name changed"
    },
    
    "comment_only": {
        "code": """def validate_age(age):
    '''Check if age meets minimum requirement.'''
    # Checking minimum age limit
    if age < 18:
        return False
    return True""",
        "changes": "0 operators (added comment)",
        "description": "Only added comment"
    }
}

print("\n" + "="*70)
print("UniXcoder Operator Sensitivity Test")
print("="*70)
print("\nOriginal Function:")
print(original)
print("\n" + "="*70)

# Generate embeddings
embeddings = {}
for name, test in test_cases.items():
    embeddings[name] = embed_code(test['code']).reshape(1, -1)

# Calculate similarities to original
base = embeddings["original"]
results = []

for name, test in test_cases.items():
    if name != "original":
        sim = cosine_similarity(base, embeddings[name])[0][0]
        results.append({
            'name': name,
            'similarity': sim,
            'changes': test['changes'],
            'description': test['description']
        })

# Sort by similarity (highest first)
results.sort(key=lambda x: x['similarity'], reverse=True)

print("\nSimilarity to Original (sorted high to low):")
print("="*70)
print(f"{'Test Case':<20} {'Similarity':<12} {'Changes':<30}")
print("-"*70)

for r in results:
    print(f"{r['name']:<20} {r['similarity']:<12.4f} {r['changes']:<30}")

print("\n" + "="*70)
print("Detailed Analysis")
print("="*70)

for r in results:
    print(f"\n{r['name']}:")
    print(f"  Similarity: {r['similarity']:.4f}")
    print(f"  Changes: {r['changes']}")
    print(f"  Description: {r['description']}")

======================================================================
UniXcoder Operator Sensitivity Test
======================================================================

Original Function:
def validate_age(age):
    '''Check if age meets minimum requirement.'''
    if age < 18:
        return False
    return True

======================================================================

Similarity to Original (sorted high to low):
======================================================================
Test Case            Similarity   Changes                       
----------------------------------------------------------------------
one_operator         0.9970       1 operator: < → >             
two_operators        0.9942       2 operators: < → >, flip returns
comment_only         0.9715       0 operators (added comment)   
three_operators      0.9623       3 operators: < → >=, 18 → 21, flip returns
cosmetic_only        0.9424       0 operators (cosmetic: age → user_age)

======================================================================
Detailed Analysis
======================================================================

one_operator:
  Similarity: 0.9970
  Changes: 1 operator: < → >
  Description: Changed: age > 18 (completely different logic)

two_operators:
  Similarity: 0.9942
  Changes: 2 operators: < → >, flip returns
  Description: Changed: age > 18, swapped True/False

comment_only:
  Similarity: 0.9715
  Changes: 0 operators (added comment)
  Description: Only added comment

three_operators:
  Similarity: 0.9623
  Changes: 3 operators: < → >=, 18 → 21, flip returns
  Description: Changed: age >= 21, swapped True/False

cosmetic_only:
  Similarity: 0.9424
  Changes: 0 operators (cosmetic: age → user_age)
  Description: Only variable name changed

Understanding Variable Names

Here’s something subtle: UniXcoder doesn’t just ignore variable names as cosmetic details. It understands they carry semantic meaning. Consider these variations:

print("="*70)
print("Why Variable Names Matter to UniXcoder")
print("="*70)

# Test: Same logic, different variable name lengths
test_cases = {
    "original": "def check(x):\n    return x < 10",
    "short_rename": "def check(y):\n    return y < 10",
    "long_rename": "def check(value):\n    return value < 10",
    "very_long_rename": "def check(input_value):\n    return input_value < 10",
    "semantic_rename": "def check(threshold):\n    return threshold < 10",
}

print("\nOriginal: def check(x): return x < 10")
print("\nComparing variable name changes:")
print("-"*70)

base = embed_code(test_cases["original"]).reshape(1, -1)

for name, code in test_cases.items():
    if name != "original":
        emb = embed_code(code).reshape(1, -1)
        sim = cosine_similarity(base, emb)[0][0]
        var_name = code.split('(')[1].split(')')[0]
        print(f"{name:20} (var={var_name:15}) similarity: {sim:.4f}")

======================================================================
Why Variable Names Matter to UniXcoder
======================================================================

Original: def check(x): return x < 10

Comparing variable name changes:
----------------------------------------------------------------------
short_rename         (var=y              ) similarity: 0.8852
long_rename          (var=value          ) similarity: 0.8973
very_long_rename     (var=input_value    ) similarity: 0.8535
semantic_rename      (var=threshold      ) similarity: 0.7618

======================================================================
Insight: Variable Names Carry Semantic Information!
======================================================================

UniXcoder doesn't just see variables as placeholders. It recognizes:
- 'age' suggests age-related logic
- 'user_age' suggests user-specific age handling
- 'threshold' suggests boundary checking
- 'x' is generic

When you change variable names, you're potentially changing the
*semantic context* of the code, which UniXcoder picks up on!

Insight: Variable Names Carry Semantic Information!

UniXcoder doesn’t just see variables as placeholders. It recognizes:

  • ‘age’ suggests age-related logic

  • ‘user_age’ suggests user-specific age handling

  • ‘threshold’ suggests boundary checking

  • ‘x’ is just generic

When you change variable names, you’re potentially changing the semantic context of the code, which UniXcoder picks up on!

Good developers choose meaningful variable names, and UniXcoder learned to recognize this.

The short rename (x to y) barely changes the embedding—both are generic mathematical variables. But renaming to age noticeably reduces similarity. Why? Because age adds semantic context. The function isn’t just checking if a number is less than 10; it’s checking if an age meets some threshold. That’s meaningful information that changes how we understand the code.

This is actually desirable behavior. If you’re searching for “age validation functions”, you want functions that use the variable name age to rank higher than functions using x. The variable name is a signal about the function’s purpose.

Comparing Code-Specific vs Generic Embeddings

Let’s run a direct comparison. We’ll use UniXcoder and a generic text embedding model (nomic-embed-text) on the same code:

import ollama

ollama_client = ollama.Client(host='http://ollama.cs.wallawalla.edu:11434')

def get_ollama_embedding(text):
    response = ollama_client.embeddings(
        model='nomic-embed-text',
        prompt=text
    )
    return response['embedding']
test_cases = {
    "original": {
        "code": """def validate_age(age):
    '''Check if age meets minimum requirement.'''
    if age < 18:
        return False
    return True""",
        "changes": "None (baseline)",
        "description": "Original: age < 18"
    },
    
    "one_operator": {
        "code": """def validate_age(age):
    '''Check if age meets minimum requirement.'''
    if age > 18:
        return False
    return True""",
        "changes": "1 operator: < → >",
        "description": "Changed: age > 18 (completely different logic)"
    },
}
# Compare UniXcoder vs Generic for one operator change
original_unix = embed_code(original).reshape(1, -1)
one_op_unix = embed_code(test_cases["one_operator"]["code"]).reshape(1, -1)

original_gen = [get_ollama_embedding(original)]
one_op_gen = [get_ollama_embedding(test_cases["one_operator"]["code"])]

unix_sim = cosine_similarity(original_unix, one_op_unix)[0][0]
gen_sim = cosine_similarity(original_gen, one_op_gen)[0][0]

print("\n" + "="*70)
print("UniXcoder vs Generic Embedding Comparison")
print("="*70)
print(f"UniXcoder similarity (< → >): {unix_sim:.4f}")
print(f"Generic similarity (< → >):   {gen_sim:.4f}")
print(f"Difference:                   {abs(unix_sim - gen_sim):.4f}")

======================================================================
UniXcoder vs Generic Embedding Comparison
======================================================================
UniXcoder similarity (< → >): 0.9970
Generic similarity (< → >):   0.9995
Difference:                   0.0025

Which model better recognizes the logical change?

UniXcoder has 0.0025 less similarity than the generic nomic-embed-text model, and so clearly it better recognizes the single atomic logical change to the code as expected!

Building a Code Search Engine

Now let’s build something practical: a semantic code search engine that finds relevant functions even when they don’t share keywords.

import numpy as np

class CodeSearchEngine:
    def __init__(self):
        self.code_snippets = []
        self.embeddings = []
    
    def index(self, code_snippet):
        """Add code to the search index."""
        embedding = embed_code(code_snippet)
        self.code_snippets.append(code_snippet)
        self.embeddings.append(embedding)
    
    def search(self, query, top_k=3):
        """Find most similar code snippets."""
        query_emb = embed_code(query)
        
        similarities = [
            cosine_similarity(query_emb, emb) 
            for emb in self.embeddings
        ]
        
        # Get top-k indices
        top_indices = np.argsort(similarities)[-top_k:][::-1]
        
        results = []
        for idx in top_indices:
            results.append({
                'code': self.code_snippets[idx],
                'similarity': similarities[idx]
            })
        
        return results

# Build a small code repository
search_engine = CodeSearchEngine()

search_engine.index("def add(a, b): return a + b")
search_engine.index("def multiply(x, y): return x * y")
search_engine.index("def read_file(path): return open(path).read()")
search_engine.index("def sum_list(items): return sum(items)")

# Search for "function that adds numbers"
results = search_engine.search("combine two numbers", top_k=2)

for i, result in enumerate(results, 1):
    print(f"\n{i}. Similarity: {result['similarity']:.4f}")
    print(f"   {result['code']}")

1. Similarity: 0.4529
   def sum_list(items): return sum(items)

2. Similarity: 0.3800
   def multiply(x, y): return x * y

Notice how the search finds multiply and sum_list as the most relevant results, even though our query doesn’t contain the word “add” or “sum”!

Code Vector Arithmetic

Here’s where things get truly magical. Remember the famous word2vec example: king - man + woman = queen? Code embeddings exhibit similar algebraic properties!

If embeddings capture semantic relationships, then vector arithmetic should reveal patterns:

  • sorting_function - recursion + iteration ≈ iterative_sort

  • python_function - python + javascript ≈ javascript_function

  • vulnerable_code - vulnerability ≈ safe_code

Let’s implement and test this. The model should identify the iterative factorial as the best match! This demonstrates that embeddings capture conceptual transformations like “converting recursion to iteration.”

def code_analogy(a, b, c, candidates, top_k=1):
    """
    Solve: a is to b as c is to ?
    (i.e., find d such that a - b ≈ c - d)
    """
    emb_a = embed_code(a)
    emb_b = embed_code(b)
    emb_c = embed_code(c)
    
    # Compute target vector: c + (a - b)
    target = emb_c + (emb_a - emb_b)
    
    # Find closest candidate
    similarities = []
    for candidate in candidates:
        emb_d = embed_code(candidate)
        sim = cosine_similarity(target, emb_d)
        similarities.append((candidate, sim))
    
    # Sort by similarity
    similarities.sort(key=lambda x: x[1], reverse=True)
    
    return similarities[:top_k]

# Example: "recursive" is to "iterative" as "factorial_recursive" is to ?
recursive = "def factorial(n): return 1 if n <= 1 else n * factorial(n-1)"
iterative = "def count(n):\n  i = 0\n  while i < n: i += 1"

fact_recursive = "def fact(n):\n  if n == 0: return 1\n  return n * fact(n-1)"


candidates = [
    """def fact_iter(n):
    result = 1
    for i in range(1, n+1): result *= i
    return result""",
    
    """def fibonacci(n):
    if n <= 1: return n
    return fibonacci(n-1) + fibonacci(n-2)""",
]

print("Analogy: recursive : iterative :: factorial_recursive : ?")
results = code_analogy(recursive, iterative, fact_recursive, candidates)

for code, score in results:
    print(f"\nScore: {score:.4f}")
    print(code)
Analogy: recursive : iterative :: factorial_recursive : ?

Score: 0.4351
def fibonacci(n):
    if n <= 1: return n
    return fibonacci(n-1) + fibonacci(n-2)

This shows something very interesting! The model is actually picking fibonacci because it’s structurally most similar to the recursive factorial pattern, even though semantically we wanted the iterative version. This teaches an important lesson: embeddings capture patterns they were trained on, and sometimes structural similarity dominates semantic transformations.

We can use multiple examples to learn the transformation better:

def improved_analogy(source_pairs, query, candidates, top_k=1):
    """
    Learn transformation from multiple (source, target) pairs.
    More robust than single-example analogy.
    """
    # Compute average transformation vector
    transform_vectors = []
    for src, tgt in source_pairs:
        emb_src = embed_code(src)
        emb_tgt = embed_code(tgt)
        transform_vectors.append(emb_tgt - emb_src)
    
    # Average transformation (recursive → iterative)
    avg_transform = np.mean([v.numpy() if torch.is_tensor(v) else v 
                            for v in transform_vectors], axis=0)
    
    # Apply to query
    emb_query = embed_code(query)
    if torch.is_tensor(emb_query):
        emb_query = emb_query.numpy()
    
    target = emb_query + avg_transform
    
    # Find best match
    similarities = []
    for candidate in candidates:
        emb_cand = embed_code(candidate)
        sim = cosine_similarity(target, emb_cand)
        similarities.append((candidate, sim))
    
    similarities.sort(key=lambda x: x[1], reverse=True)
    return similarities[:top_k]

# Provide multiple examples of recursive→iterative transformation
recursive_iterative_pairs = [
    ("""def sum_rec(n):
        if n == 0: return 0
        return n + sum_rec(n-1)""",
     """def sum_iter(n):
        total = 0
        for i in range(n+1): total += i
        return total"""),
    
    ("""def count_rec(n):
        if n == 0: return
        count_rec(n-1)
        print(n)""",
     """def count_iter(n):
        for i in range(1, n+1):
            print(i)"""),
]

query = """def factorial_rec(n):
    if n <= 1: return 1
    return n * factorial_rec(n-1)"""

print("\nImproved analogy with multiple examples:")
results = improved_analogy(recursive_iterative_pairs, query, candidates)
for code, score in results:
    print(f"Score: {score:.4f}")
    print(code)

Improved analogy with multiple examples:
Score: 0.7618
def fact_iter(n):
    result = 1
    for i in range(1, n+1): result *= i
    return result

And now we get the correct result!

The Multiple-Exemplar Approach: Learning Robust Transformations

The multiple-exemplar approach addresses a fundamental limitation of single-example analogies. Let’s break this down with both intuition and implementation.

The Problem with Single Examples

When we do a - b + c, we’re assuming that the vector a - b captures a pure, general transformation. But in reality:

# What we hope:
recursive_func - iterative_func = "pure recursion concept"

# What we actually get:
recursive_func - iterative_func = "recursion" + "factorial-ness" + "variable names" + noise

A single example is contaminated by the specific details of those two functions!

By averaging transformations across multiple examples, we cancel out the noise and extract the common pattern.

Why This Works: The Math

The key insight is linear superposition. If embeddings capture semantic features linearly:

# Single example (noisy):
transform₁ = (recursive_concept + sum_specific_details) - 
             (iterative_concept + sum_specific_details)
           = recursive_concept - iterative_concept + noise₁

# Multiple examples:
avg_transform = mean([
    recursive_concept - iterative_concept + noise₁,
    recursive_concept - iterative_concept + noise₂,
    recursive_concept - iterative_concept + noise₃
])

# Noise cancels out!
avg_transform ≈ recursive_concept - iterative_concept

The multiple-exemplar approach teaches us that the standard lessons from machine learning apply here: one example is anecdote, and many examples are data.

This principle extends beyond code embeddings to:

  • Style transfer in images (multiple examples of “Picasso style”)

  • Voice conversion (multiple examples of speaker A → speaker B)

  • Language translation (multiple phrase pairs)

The beauty of this is that the embedding space has linear structure that lets us manipulate concepts algebraically!

The Theory Behind Code Embeddings

Why do code embeddings work so well? The secret lies in how these models are trained.

UniXcoder and similar models use several training objectives:

  1. Masked Language Modeling (MLM): Randomly mask tokens and predict them

    # Input:  "def sort([MASK]): return sorted(arr)"
    # Predict: "arr"
  2. Contrastive Learning: Similar code should have similar embeddings

    • Positive pairs: (code, docstring), (original, paraphrased)

    • Negative pairs: Random unrelated code

  3. Next Token Prediction: Predict the next token in a sequence

The ideal code embedding model should capture semantic, be robust to variable naming and style and generalize to unseen code patterns. The increase in the performance and efficiency of modern code models represents a profound shift in how machines can reason about computation itself!

Limitations

UniXcoder isn’t perfect. It has a maximum context length (512 tokens), so very long functions get truncated. It was trained primarily on popular languages (Python, Java, JavaScript), so it might not understand exotic languages as well. And it captures syntactic and structural similarities better than deep semantic equivalence—two functions that compute the same result through completely different algorithms might not score as similar as you’d expect.

Most importantly, embeddings compress complex code into fixed-size vectors. Information is inevitably lost. Two functions might have the same embedding despite subtle but important differences. The model provides similarity scores, not guarantees of equivalence.