Notebook Cell
import ollama
# Configure your server URL here
SERVER_HOST = 'http://ollama.cs.wallawalla.edu:11434'
client = ollama.Client(host=SERVER_HOST)
def call_ollama(prompt, model="cs450", **options):
"""
Send a prompt to the Ollama API.
Args:
prompt (str): The prompt to send
model (str): Model name to use
**options: Additional model parameters (temperature, top_k, etc.)
Returns:
str: The model's response
"""
try:
response = client.generate(
model=model,
prompt=prompt,
options=options
)
return response['response']
except Exception as e:
return f"Error: {e}"
def call_ollama_full(prompt, model="cs450", **options):
try:
response = client.generate(
model=model,
prompt=prompt,
options=options
)
return response
except Exception as e:
return f"Error: {e}"
Introduction¶
Before a language model can process code, it must first convert that code into a representation it can understand. This conversion process — tokenization — is more fundamental than it might initially appear. The way code is tokenized directly impacts what the model “sees,” what patterns it can learn, and ultimately how well it can generate, understand, and manipulate code.
Unlike natural language, where tokenization boundaries often align with word boundaries, code presents unique challenges. Should getUserById be treated as one token or split into get, User, By, Id? How should operators like ->, ==, or >>> be handled? What about indentation in Python, where whitespace carries semantic meaning?
This chapter explores how language models represent code at the token level, why these representations matter for code generation tasks, and how understanding tokenization helps you write better prompts and debug model behavior.
The Basics: What is Tokenization?¶
Tokenization is the process of breaking text (including code) into discrete units called tokens. These tokens become the atomic elements the model operates on — it reads tokens, thinks in tokens, and generates tokens.
def demonstrate_tokenization_concept():
"""Show how models see text as tokens, not characters."""
# Ask the model to count tokens in different strings
test_strings = [
"hello",
"hello world",
"getUserById",
"get_user_by_id",
"x = y + z"
]
print("Understanding Tokenization\n" + "="*60 + "\n")
for string in test_strings:
prompt = f"""How many tokens would a language model typically use to represent this text: "{string}"
Just give me a number and brief explanation."""
response = call_ollama(prompt, temperature=0.2, num_predict=60)
print(f"Text: '{string}'")
print(f"Response: {response}\n")
if __name__ == "__main__":
demonstrate_tokenization_concept()Understanding Tokenization
============================================================
Text: 'hello'
Response: 3 tokens
Explanation: The word "hello" is usually represented by 3 tokens in most language models, including the start-of-word token, the word itself, and the end-of-word token.
Text: 'hello world'
Response: 5 tokens
Explanation: The text "hello world" is typically represented using 5 tokens in most language models, including the space between "hello" and "world".
Text: 'getUserById'
Response: 3 tokens
Explanation: The phrase "getUserById" is a simple function name or method call, which can be represented by three tokens in most language models: one for the verb ("get"), one for the noun ("User"), and one for the identifier ("byId").
Text: 'get_user_by_id'
Response: 3
The term "get_user_by_id" is a simple function name consisting of three words, which would typically be represented by three tokens in most language models.
Text: 'x = y + z'
Response: 3
The text "x = y + z" is represented by three tokens in most language models: one for each variable (x, y, z) and one for the operator (+).
Code Tokenization¶
Let’s explore how code elements are typically tokenized.
def explore_vocabulary():
"""Explore how different code elements might be tokenized."""
code_samples = [
"def calculate_sum(a, b):",
"function calculateSum(a, b) {",
"public static void main(String[] args) {",
"x = [i**2 for i in range(10)]"
]
print("Code Tokenization Patterns\n" + "="*60 + "\n")
for code in code_samples:
prompt = f"""For a code-specialized language model, describe how this code would likely be tokenized:
Code: {code}
List the approximate tokens (split by '|' at the likely token boundaries). Be brief."""
response = call_ollama(prompt, temperature=0.1, num_predict=120)
print(f"Code: {code}")
print(f"Tokenization: {response}\n")
if __name__ == "__main__":
explore_vocabulary()Code Tokenization Patterns
============================================================
Code: def calculate_sum(a, b):
Tokenization: def|calculate_sum|(|a|,|b||)|:
Code: function calculateSum(a, b) {
Tokenization: function|calculateSum|(|a|,|b||)|{
Code: public static void main(String[] args) {
Tokenization: public|static|void|main|(String[]|args)|{|}|
Code: x = [i**2 for i in range(10)]
Tokenization: x | = | [ | i | ** | 2 | for | i | in | range | ( | 10 | ) | ]
Why Tokenization Matters for Code¶
Unlike natural language, code has:
- Precise syntax: Every character can matter (
=vs==) - Meaningful whitespace: Indentation in Python, formatting in all languages
- Special operators:
->,::,>>>,**, etc. - Case sensitivity:
userNamevsUserNamevsUSERNAME - Domain-specific identifiers: API names, library functions, variable names
Poor tokenization can lead to:
- Inefficient representation (more tokens = less context fits in the window)
- Loss of structural information
- Difficulty learning patterns
- Poor generation quality
def demonstrate_tokenization_impact():
"""Show how tokenization affects model understanding."""
# Same functionality, different naming conventions
code_variants = [
"def get_user_by_id(user_id):\n return database.find(user_id)",
"def getUserById(userId):\n return database.find(userId)",
"def GETUSERBYID(USERID):\n return DATABASE.FIND(USERID)"
]
print("Tokenization Impact on Understanding\n" + "="*60 + "\n")
for code in code_variants:
prompt = f"""Is this proper code style? One sentence.
{code}
Assessment:"""
response = call_ollama(prompt, temperature=0.3, num_predict=50)
print(f"Code:\n{code}\n")
print(f"Assessment: {response}\n")
print("-" * 60 + "\n")
if __name__ == "__main__":
demonstrate_tokenization_impact()Tokenization Impact on Understanding
============================================================
Code:
def get_user_by_id(user_id):
return database.find(user_id)
Assessment: Yes, the provided code snippet follows a simple and straightforward style that is generally considered acceptable in Python for defining a function to retrieve a user by their ID from a database. The function name `get_user_by_id` clearly describes its purpose, and the
------------------------------------------------------------
Code:
def getUserById(userId):
return database.find(userId)
Assessment: Yes, the provided code snippet is in proper Python code style. It follows the PEP 8 guidelines for function naming and uses clear, concise syntax. The function `getUserById` takes a parameter `userId` and returns the result of calling the
------------------------------------------------------------
Code:
def GETUSERBYID(USERID):
return DATABASE.FIND(USERID)
Assessment: No, this is not proper code style. The function name should be in lowercase with words separated by underscores, and it's a good practice to include type hints for parameters and return values. Here's an improved version:
```python
def get_user
------------------------------------------------------------
Byte Pair Encoding (BPE): The Standard Approach¶
Most modern LLMs use Byte Pair Encoding (BPE) or variants like WordPiece or SentencePiece. BPE learns a vocabulary by iteratively merging the most frequent character pairs.
How BPE Works (Simplified)¶
- Start with character-level vocabulary
- Find most frequent adjacent pair
- Merge this pair into a new token
- Repeat until vocabulary reaches desired size
In the following example, we:
- Use the
tiktokenlibrary to actually count tokens - Show common/rare tokenization differences (e.g., “function” = 1 token, “funcxzqtion” = 3-4 tokens)
tiktoken is a fast Byte Pair Encoding (BPE) tokenizer developed by OpenAI for use with their language models. It allows you to convert text into tokens (numerical representations) and vice versa.
import tiktoken
def count_tokens(text):
"""Count tokens using tiktoken (approximates most BPE tokenizers)."""
enc = tiktoken.get_encoding("cl100k_base") # GPT-4 encoding
return len(enc.encode(text))
def explore_bpe_behavior():
"""Show how common vs rare patterns use different token counts."""
examples = [
("Common word", "function"),
("Rare word", "funcxzqtion"),
("Common code", "def factorial(n):"),
("Rare code", "def qzxfactorial(n):"),
]
print("BPE Token Counts: Common vs Rare\n" + "="*50 + "\n")
for label, text in examples:
tokens = count_tokens(text)
print(f"{label:15} | {text:25} | {tokens} tokens")
if __name__ == "__main__":
explore_bpe_behavior()BPE Token Counts: Common vs Rare
==================================================
Common word | function | 1 tokens
Rare word | funcxzqtion | 4 tokens
Common code | def factorial(n): | 4 tokens
Rare code | def qzxfactorial(n): | 8 tokens
Alternatively, we could use prompt_eval_count from the model’s response to get the actual token counts, showing that common patterns like “function” tokenize more efficiently than rare variants like “funcxzqtion”.
def count_tokens(text):
"""Get actual token count from Ollama."""
response = call_ollama_full(
text,
temperature=0,
num_predict=1,
format="json" # Returns full JSON response instead of just text
)
return response['prompt_eval_count']
def explore_bpe_behavior():
"""Show how common vs rare patterns use different token counts."""
examples = [
("Common word", "function"),
("Rare word", "funcxzqtion"),
("Common code", "def factorial(n):"),
("Rare code", "def qzxfactorial(n):"),
]
print("BPE Token Counts: Common vs Rare\n" + "="*50 + "\n")
for label, text in examples:
tokens = count_tokens(text)
print(f"{label:15} | {text:25} | {tokens} tokens")
if __name__ == "__main__":
explore_bpe_behavior()BPE Token Counts: Common vs Rare
==================================================
Common word | function | 30 tokens
Rare word | funcxzqtion | 33 tokens
Common code | def factorial(n): | 33 tokens
Rare code | def qzxfactorial(n): | 37 tokens
Special Tokens and Code Structure¶
Code models use special tokens to mark structural boundaries:
<|endoftext|>- End of document<|fim_prefix|>,<|fim_suffix|>,<|fim_middle|>- Fill-in-the-middle tasks<|python|>,<|javascript|>- Language markers\n,\t- Whitespace (often separate tokens)
This is crucial to improve the quality of code completion tasks based on the current context.
def show_boundary_effects():
"""Demonstrate how structural context changes completion."""
# Same incomplete code in different structural positions
incomplete = "result = calculate"
scenarios = [
("Top-level (module scope)",
f"{incomplete}"),
("Inside function body",
f"def process(x):\n {incomplete}"),
("After if statement",
f"if data:\n {incomplete}"),
("In class method",
f"class Processor:\n def __init__(self, x): self.data = x \n def run(self):\n {incomplete}")
]
for label, code in scenarios:
response = call_ollama(
f"""Assume `calculate is a simple arithmetic operation for demonstration.
Complete this Python code with only one line:\n\n{code}""",
temperature=0.1,
num_predict=31
)
print(f"{label}:")
print(f" → {response}\n")
show_boundary_effects()Top-level (module scope):
→ ```python
result = calculate()
```
Inside function body:
→ ```python
def process(x):
result = calculate(x)
```
After if statement:
→ ```python
result = calculate(data) if data else None
```
In class method:
→ ```python
result = calculate(self.data)
```
Example output patterns shown:
- Top-level: (function call)
- Inside function: (with parameter)
- After if: (action-oriented)
- In class method: (uses
self)
The model learns that different structural positions (module vs. function vs. class) have different token distribution patterns, leading to contextually appropriate completions.
In the above example, we use one concrete example (the incomplete statement) across various contexts, demonstrating the key insight: position and context really matter!
The model’s completion changes based on structural context because different token patterns are statistically associated with different code structures.
Indentation and Whitespace Tokenization¶
Python’s significant whitespace poses unique challenges. Models must learn that indentation carries semantic meaning.
def test_indentation_understanding():
"""Test if model understands indentation semantics."""
code_samples = [
("Correct indentation", """def greet(name):
print(f"Hello, {name}")
return name"""),
("Incorrect indentation", """def greet(name):
print(f"Hello, {name}")
return name"""),
("Missing indentation", """def greet(name):
print(f"Hello, {name}")
return name""")
]
print("Indentation Understanding\n" + "="*60 + "\n")
for label, code in code_samples:
prompt = f"""Is this Python code correct? Answer yes or no and explain very briefly.
{code}
Answer:"""
response = call_ollama(prompt, temperature=0.1, num_predict=70)
print(f"{label}")
print(f"Code:\n{code}\n")
print(f"Model says: {response}\n")
print("-" * 60 + "\n")
if __name__ == "__main__":
test_indentation_understanding()Indentation Understanding
============================================================
Correct indentation
Code:
def greet(name):
print(f"Hello, {name}")
return name
Model says: Yes.
------------------------------------------------------------
Incorrect indentation
Code:
def greet(name):
print(f"Hello, {name}")
return name
Model says: No. The `print` statement is not indented correctly. In Python, indentation is crucial for defining the blocks of code. Here's the corrected version:
```python
def greet(name):
print(f"Hello, {name}")
return name
```
Now it should work as expected.
------------------------------------------------------------
Missing indentation
Code:
def greet(name):
print(f"Hello, {name}")
return name
Model says: No. The `print` statement is not indented correctly. In Python, indentation is crucial as it defines the blocks of code.
------------------------------------------------------------
Code models learn to associate indentation tokens with control flow and scope, enabling them to generate properly indented code.
Tokenization Efficiency: Token Count Matters¶
The number of tokens affects:
- Context window usage: Fewer tokens = more context fits
- Generation cost: More tokens = higher API costs
- Processing speed: More tokens = slower inference
def compare_token_efficiency():
"""Compare token efficiency of different coding styles."""
# Use GPT-4 tokenizer (cl100k_base) as representative example
encoding = tiktoken.get_encoding("cl100k_base")
implementations = [
("Verbose", """def calculate_sum_of_squares(input_numbers):
total_sum = 0
for individual_number in input_numbers:
squared_value = individual_number * individual_number
total_sum = total_sum + squared_value
return total_sum"""),
("Concise", """def sum_squares(nums):
return sum(n * n for n in nums)"""),
("Mathematical", """def sum_squares(nums):
return sum(n**2 for n in nums)""")
]
print("Token Efficiency Comparison\n" + "="*60 + "\n")
for label, code in implementations:
tokens = encoding.encode(code)
token_count = len(tokens)
print(f"{label}: {token_count} tokens")
print(f"Code:\n{code}\n")
print(f"Tokens: {tokens[:10]}..." if len(tokens) > 10 else f"Tokens: {tokens}")
print(f"Efficiency: {len(code) / token_count:.1f} chars/token\n")
print("-" * 60 + "\n")
if __name__ == "__main__":
compare_token_efficiency()Token Efficiency Comparison
============================================================
Verbose: 48 tokens
Code:
def calculate_sum_of_squares(input_numbers):
total_sum = 0
for individual_number in input_numbers:
squared_value = individual_number * individual_number
total_sum = total_sum + squared_value
return total_sum
Tokens: [755, 11294, 10370, 3659, 646, 41956, 5498, 34064, 997, 262]...
Efficiency: 4.9 chars/token
------------------------------------------------------------
Concise: 17 tokens
Code:
def sum_squares(nums):
return sum(n * n for n in nums)
Tokens: [755, 2694, 646, 41956, 21777, 997, 262, 471, 2694, 1471]...
Efficiency: 3.4 chars/token
------------------------------------------------------------
Mathematical: 17 tokens
Code:
def sum_squares(nums):
return sum(n**2 for n in nums)
Tokens: [755, 2694, 646, 41956, 21777, 997, 262, 471, 2694, 1471]...
Efficiency: 3.4 chars/token
------------------------------------------------------------
```{note}
**Trade-off**: Verbose code may be more readable but consumes more tokens. Concise code is token-efficient but may be less clear.
```
```{note}
Key insight: The verbose version uses **3.5x more tokens** for the same functionality, consuming more context window and costing more to process.
```Cross-Language Tokenization¶
Code models are typically trained on multiple languages. The tokenizer must handle diverse syntax efficiently.
def explore_cross_language_tokenization():
"""Explore how different languages tokenize."""
equivalent_code = [
("Python", "def add(a, b):\n return a + b"),
("JavaScript", "function add(a, b) {\n return a + b;\n}"),
("Java", "public int add(int a, int b) {\n return a + b;\n}"),
("Rust", "fn add(a: i32, b: i32) -> i32 {\n a + b\n}")
]
print("Cross-Language Tokenization\n" + "="*60 + "\n")
for lang, code in equivalent_code:
prompt = f"""Which parts of this {lang} code would likely be tokenized as single tokens
vs split into multiple tokens? Answer as briefly (in as few lines) as possible
{code}
Brief analysis:"""
response = call_ollama(prompt, temperature=0.2, num_predict=200)
print(f"{lang}")
print(f"Code:\n{code}\n")
print(f"Analysis: {response}\n")
print("-" * 60 + "\n")
if __name__ == "__main__":
explore_cross_language_tokenization()Cross-Language Tokenization
============================================================
Python
Code:
def add(a, b):
return a + b
Analysis: In the given Python code:
- `def`, `add`, `(`, `a`, `,`, `b`, `)`, `:` are single tokens.
- `return`, `a`, `+`, `b` are also single tokens.
------------------------------------------------------------
JavaScript
Code:
function add(a, b) {
return a + b;
}
Analysis: - `function`, `add`, `(`, `a`, `,`, `b`, `)`, `{`, `return`, `a`, `+`, `b`, `;`, `}` would be tokenized as single tokens.
- The keywords and symbols are typically treated as individual tokens in JavaScript parsing.
------------------------------------------------------------
Java
Code:
public int add(int a, int b) {
return a + b;
}
Analysis: ```java
public int add(int a, int b) { // Single token: public, int, add, (, int, a, , int, b, ), {, return, a, +, b, ;, }
return a + b; // Single token: return, a, +, b, ;
}
```
------------------------------------------------------------
Rust
Code:
fn add(a: i32, b: i32) -> i32 {
a + b
}
Analysis: ```rust
fn, add, (, a, :, i32, ,, b, :, i32, ), ->, i32, {, a, +, b, }
```
------------------------------------------------------------
Impact on Code Generation Quality¶
Understanding tokenization helps explain common generation issues:
### Issue 1: Variable Name Fragments
def demonstrate_naming_issues():
"""Show how tokenization affects variable naming."""
prompts = [
"Generate a Python function with a variable name for storing user authentication tokens",
"Generate a Python function with a variable name for user auth tokens (use common abbreviation)"
]
print("Variable Naming and Tokenization\n" + "="*60 + "\n")
for prompt in prompts:
full_prompt = f"{prompt}. Just show the line with the variable"
response = call_ollama(full_prompt, temperature=0.5, num_predict=60)
print(f"Prompt: {prompt}")
print(f"Generated:\n{response}\n")
print("-" * 60 + "\n")
if __name__ == "__main__":
demonstrate_naming_issues()Variable Naming and Tokenization
============================================================
Prompt: Generate a Python function with a variable name for storing user authentication tokens
Generated:
```python
auth_token = "your_auth_token_here"
```
------------------------------------------------------------
Prompt: Generate a Python function with a variable name for user auth tokens (use common abbreviation)
Generated:
```python
auth_token = "your_auth_token_here"
```
------------------------------------------------------------
Uncommon or very long variable names may fragment into many tokens, leading models to avoid them or truncate them.
Most likely, your coding model will abbreviate “authentication” to “auth” in both cases, regardless of instruction.
### Issue 2: Operator Spacing
def demonstrate_spacing_consistency():
"""Show how models handle operator spacing."""
prompt = """Generate 3 variations of this Python expression with different spacing:
x = a + b * c
Variations:"""
print("Operator Spacing Consistency\n" + "="*60 + "\n")
response = call_ollama(prompt, temperature=0.7, num_predict=100)
print(response)
if __name__ == "__main__":
demonstrate_spacing_consistency()Operator Spacing Consistency
============================================================
Sure, here are three variations of the Python expression `x = a + b * c` with different spacing:
1. ```python
x = a + (b * c)
```
2. ```python
x= a+ b*c
```
3. ```python
x = a + b * c
```
Models learn spacing patterns from training data. Inconsistent tokenization of operators can lead to inconsistent spacing in generated code.
One good way to deal with this sort of issue is to ensure consistency by using prompting techniques which provide examples, as we’ll see later on.
### Issue 3: Comment Handling
def explore_comment_tokenization():
"""Explore how comments are tokenized and generated."""
prompt = """Write a Python function to calculate factorial with detailed comments:"""
print("Comment Generation\n" + "="*60 + "\n")
response = call_ollama(prompt, temperature=0.5, num_predict=200)
print(response)
print("\n" + "="*60 + "\n")
# Now ask for code without comments
prompt2 = """Write a Python function to calculate factorial with NO comments:"""
response2 = call_ollama(prompt2, temperature=0.5, num_predict=150)
print("\nWithout comments:")
print(response2)
if __name__ == "__main__":
explore_comment_tokenization()
Comment Generation
============================================================
Certainly! Below is a Python function that calculates the factorial of a given number along with detailed comments explaining each part of the code:
```python
def factorial(n):
"""
Calculates the factorial of a non-negative integer n.
Args:
n (int): A non-negative integer whose factorial is to be calculated.
Returns:
int: The factorial of the input number n.
Raises:
ValueError: If the input n is negative, as factorial is not defined for negative numbers.
"""
# Check if the input is a non-negative integer
if not isinstance(n, int) or n < 0:
raise ValueError("Input must be a non-negative integer.")
# Base case: factorial of 0 or 1 is 1
if n == 0 or n == 1:
return 1
# Initialize the result to 1 (factorial starts from 1)
result = 1
============================================================
Without comments:
```python
def factorial(n):
if n == 0:
return 1
else:
return n * factorial(n-1)
```
Comments are tokenized just like code. Models must learn when to generate comments vs code based on token patterns.
Tokenization is the bridge between human-readable code and model-processable representations. By understanding this bridge, you can:
- Write more effective prompts
- Debug unexpected model behavior
- Optimize context window usage
- Predict when models will struggle (rare patterns, uncommon languages)
- Design better coding conventions for AI-assisted development
Summary¶
This chapter establishes tokenization as a fundamental, performance-critical process that dictates how a Large Language Model “sees,” “thinks about,” and “generates” source code. It is not a trivial preprocessing step but rather the essential bridge connecting the symbolic, human-readable world of programming to the numerical, model-processable world of token IDs.
The chapter explains that modern LLMs for code, like their natural language counterparts, overwhelmingly rely on subword tokenization algorithms. The most prominent of these, Byte Pair Encoding (BPE), is explored. The BPE process is conceptually straightforward: it begins with a base vocabulary of individual characters and iteratively merges the most frequently occurring adjacent pair of tokens in its training data into a new, single token. This process is repeated until a predefined vocabulary size (e.g., 50,000 or 100,000+ tokens) is reached. The result is a highly efficient vocabulary where common sequences—such as programming keywords (def, import, function) or standard library names—are represented as single, efficient tokens. Conversely, rare or novel sequences—like uncommon variable names or project-specific API identifiers—are fragmented into multiple, less efficient subword tokens.
A central theme of the chapter is the tension between these frequency-based tokenization models and the unique, precise properties of source code. Unlike natural language, code is defined by its rigid syntax (e.g., = vs. ==), semantically meaningful whitespace (e.g., Python’s indentation), strict case sensitivity (userName vs. USERNAME), and reliance on special operators (->, ::). Poor tokenization of these elements can lead to a fundamental loss of structural and semantic information before the model ever processes the input.
The practical consequences of this tension are significant and manifest in two primary areas:
- Efficiency and Cost: Inefficient tokenization (fragmentation) inflates the token count, consuming the model’s finite context window more rapidly. This directly translates to higher API costs, slower inference speeds, and a reduced “token budget” for a developer’s prompts and any necessary retrieved context.
- Generation Quality: Tokenization directly impacts the quality of the generated code. The chapter highlights how models may generate code with inconsistent operator spacing or favor shorter, more common variable names. This is often not a failure of the model’s “reasoning” but a direct consequence of its tokenizer having fragmented these patterns, making them statistically easier to learn and reproduce.
Finally, the chapter explores how the model’s “view” of code is explicitly structured by special tokens. These include structural markers (\<|endoftext|\>), tokens for “fill-in-the-middle” tasks (\<|fim\_prefix|\>, \<|fim\_suffix|\>), and language-specific markers (\<|python|\>, \<|javascript|\>) that prime the model for a specific syntax.
By understanding this tokenization layer, developers and engineers gain the ability to write more token-efficient prompts, debug non-obvious model failures, and better predict when a model will struggle or succeed, ultimately leading to a more effective application of LLMs in the software engineering lifecycle.
Glossary of Key Terms¶
- Byte Pair Encoding (BPE): A foundational subword tokenization algorithm. BPE begins with a base vocabulary of individual characters (or bytes) and iteratively learns a set of merge rules. In each step, it finds the most frequently occurring adjacent pair of tokens in the training corpus and merges them into a new, single token. This process is repeated until a desired vocabulary size is reached.
- Context Window: The finite and fixed number of tokens that an LLM can process at one time. This limit includes the input prompt, any retrieved context, and the generated output. Efficient tokenization is critical for maximizing the amount of information that can fit within this window.
- SentencePiece: A tokenization algorithm and software library that treats all input text, including whitespace, as a raw unicode sequence. Its key innovation is encoding whitespace as a special character (e.g., ), which allows it to tokenize and de-tokenize text reversibly without relying on language-specific pre-tokenization rules. It can be trained to use either BPE or Unigram models.
- Special Tokens: A set of tokens reserved in the vocabulary to represent metadata, structural boundaries, or control signals rather than literal text. Examples from this chapter include :
\<|endoftext|\>: A token that marks the end of a document or logical text segment.\<|fim\_prefix|\>,\<|fim\_suffix|\>,\<|fim\_middle|\>: Tokens used in “fill-in-the-middle” (FIM) tasks, allowing the model to be trained to insert code between a given prefix and suffix.\<|python|\>,\<|javascript|\>: Language-specific markers used to prime the model to generate code in a particular language.\\n,\\t: Tokens that explicitly represent whitespace characters (newlines and tabs), which is critical for semantically meaningful indentation in languages like Python.
- Subword Tokenization: The dominant tokenization paradigm for modern LLMs. It serves as a compromise between word-level tokenization (which results in a massive vocabulary and fails to handle unknown words) and character-level tokenization (which has a small vocabulary but results in very long, inefficient token sequences). Subword algorithms break words into commonly occurring morphemes or “subwords”.
- Tokenization: The process of converting a sequence of raw text (e.g., source code or natural language) into a sequence of discrete units called tokens.
- Tokens: The atomic elements that an LLM operates on. After tokenization, each token is mapped to a unique integer ID from a fixed vocabulary. The model reads, processes, and generates sequences of these token IDs.
- Unigram: A subword tokenization algorithm that operates in reverse of BPE. It starts with a very large vocabulary (e.g., all words and common substrings) and prunes tokens. It iteratively removes the (e.g., 10-20%) of tokens that least affect the overall likelihood of the training data according to a unigram language model, repeating until the target vocabulary size is reached.
- WordPiece: A subword tokenization algorithm used by models like BERT. It is similar to BPE, but instead of merging the most frequent pair, it merges the pair that maximizes the likelihood of the training data once merged. It essentially evaluates the “loss” of a merge to ensure it is statistically valuable.