What is Tokenization in AI? How LLMs Read Your Text
Tokenization is the first step in how AI understands your text. Learn why LLMs chop words into pieces and how this affects everything from pricing to model behavior.

What is Tokenization in AI?
You type a question into ChatGPT. Before the model can even start thinking about your words, it faces a fundamental problem: computers don't understand text. They only understand numbers.
Tokenization is the bridge between human language and machine computation. It's the process of breaking your text into small pieces called tokens and assigning each piece a number.
Without tokenization, AI language models simply cannot function.
The Problem: Why Not Just Use Whole Words?
At first glance, the obvious solution seems simple: give every word its own number.
- "cat" = 1
- "dog" = 2
- "the" = 3
- And so on...
But this approach falls apart quickly:
-
Vocabulary explosion: English has over 170,000 words in current use. Add names, technical terms, typos, slang, and other languages, and you need millions of entries.
-
Unknown words: What happens when someone types "ChatGPTify" or "unfriendliest"? A word-based system has never seen these and cannot process them.
-
Compound languages: German words like "Donaudampfschifffahrtsgesellschaft" (Danube steamship company) are single words that would each need their own entry.
-
Inefficiency: Storing millions of unique word vectors wastes memory and slows down the model.
The solution? Subword tokenization.
How Modern Tokenization Works
Instead of whole words, modern LLMs break text into subword units. These are pieces that are bigger than individual letters but often smaller than complete words.
The Core Idea
The tokenizer learns which character combinations appear frequently in text and groups them together. Common words stay intact. Rare words get split into recognizable pieces.
For example:
- "unhappiness" might become:
["un", "happiness"]or["un", "hap", "pi", "ness"] - "ChatGPT" might become:
["Chat", "G", "PT"] - "the" stays as:
["the"]
This approach gives you the best of both worlds: a manageable vocabulary size (typically 32,000 to 100,000 tokens) that can represent any possible text.
Byte-Pair Encoding (BPE): The Most Popular Method
Most modern LLMs use a technique called Byte-Pair Encoding (BPE). Here's how it works:
Training Phase (Done Once by Model Creators)
-
Start with characters: Begin with every individual character as its own token.
-
Count pairs: Look at a massive text corpus and count how often each pair of adjacent tokens appears.
-
Merge the most common pair: Take the most frequent pair and combine them into a new token.
-
Repeat: Do this thousands of times until you reach your target vocabulary size.
A Simple Example
Imagine training on the text: "low lower lowest"
Starting vocabulary: [l, o, w, e, r, s, t, (space)]
Iteration 1: "lo" appears most often. Merge into new token "lo". Iteration 2: "low" appears frequently. Merge "lo" + "w" into "low". Iteration 3: "er" is common. Merge into "er". Iteration 4: "est" is common. Merge "e" + "st" into "est".
After many iterations, you end up with tokens like:
- "low" (very common, stays whole)
- "er" (common suffix)
- "est" (common suffix)
Now "lowest" tokenizes as ["low", "est"] instead of 6 individual characters.
Which Models Use BPE?
Model Family | Tokenizer Type |
|---|---|
GPT-2, GPT-3, GPT-4 | BPE |
Llama, Llama 2, Llama 3 | Byte-level BPE |
Claude | Custom (BPE-based) |
Mistral | BPE |
Other Tokenization Methods
WordPiece (Used by BERT)
Similar to BPE, but instead of merging the most frequent pairs, WordPiece merges pairs that maximize the likelihood of the training data. BERT and its variants use this method.
WordPiece marks subwords with "##" to show they continue a previous token:
- "unhappiness" becomes
["un", "##happiness"]
SentencePiece (Used by T5, ALBERT)
Treats the input as a raw stream of characters, including spaces. This makes it language-agnostic and useful for multilingual models. A space becomes a special character "▁" at the start of words.
Unigram (Used alongside SentencePiece)
Starts with a large vocabulary and removes tokens that contribute least to the overall probability of the text. It's the opposite approach from BPE.
What Happens When You Send Text to an LLM
Let's trace through a real example.
Your input: "The quick brown fox"
Step 1: Tokenization The tokenizer breaks this into tokens:
["The", " quick", " brown", " fox"]
(Note: spaces are often attached to the following word)
Step 2: Token IDs Each token maps to a number in the vocabulary:
[464, 2068, 7586, 21831]
Step 3: Into the Model These numbers go through the embedding layer and into the transformer. The model never sees your original text, only these IDs.
Step 4: Output The model generates output token IDs, which the tokenizer converts back to text.
Why Tokenization Matters for You
1. It Affects Pricing
AI APIs charge per token. Understanding tokenization helps you estimate costs:
Text | Approximate Tokens |
|---|---|
"Hello" | 1 token |
"Hello, world!" | 3-4 tokens |
1 page of text | ~500 tokens |
Average email | 100-300 tokens |
Rule of thumb: 1 token ≈ 4 characters or 0.75 words in English.
2. It Affects Context Limits
When a model has a "128K context window," that means 128,000 tokens, not words. Your actual word limit is roughly 75% of that.
3. It Explains Weird Model Behavior
Ever wonder why AI models:
-
Struggle with counting letters? "How many r's in strawberry?" The model sees
["str", "aw", "berry"]and the letter boundaries are hidden. -
Have trouble with simple arithmetic? Numbers tokenize inconsistently. "1234" might be one token, while "12345" becomes two.
-
Handle code differently than prose? Code has different token patterns. Variable names and syntax get split in unexpected ways.
4. It Varies by Language
Tokenizers trained primarily on English are less efficient for other languages:
Language | Tokens for "Hello, how are you?" equivalent |
|---|---|
English | ~6 tokens |
Chinese | ~10-15 tokens |
Japanese | ~12-18 tokens |
Arabic | ~8-12 tokens |
This means non-English users often pay more and use more context for the same content.
Tokenization in Practice: Try It Yourself
Most AI providers offer tokenizer tools:
- OpenAI: The tiktoken library lets you see exactly how GPT models tokenize text.
- Hugging Face: Every model page includes a tokenizer you can test.
- Anthropic: Claude's tokenization is similar to GPT but with some differences.
Here's a quick Python example using tiktoken:
import tiktoken
# Load the GPT-4 tokenizer
enc = tiktoken.encoding_for_model("gpt-4")
text = "Tokenization is fascinating!"
tokens = enc.encode(text)
print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")
# Decode back to see the pieces
for token in tokens:
print(f" {token} -> '{enc.decode([token])}'")Output:
Text: Tokenization is fascinating!
Tokens: [3404, 2065, 374, 27387, 0]
Token count: 5
3404 -> 'Token'
2065 -> 'ization'
374 -> ' is'
27387 -> ' fascinating'
0 -> '!'
Key Takeaways
Concept | What to Remember |
|---|---|
What is tokenization? | Breaking text into smaller pieces (tokens) that the model can process |
Why subwords? | Balances vocabulary size with the ability to handle any text |
BPE | Most common method, used by GPT and Llama families |
Token ≈ 0.75 words | Rule of thumb for estimating token counts |
Why it matters | Affects pricing, context limits, and model behavior |
Tokenization might seem like a mundane preprocessing step, but it's the foundation that makes language models possible. Every word you type passes through this invisible translation layer before AI can understand it.
What's Next?
Now that you understand how text becomes tokens, the next step is learning how those tokens become meaningful representations. Check out our article on embeddings to see how token IDs transform into rich numerical vectors that capture meaning.


