Why don't LLMs just use whole words?

Using whole words would require millions of IDs for every possible word, typo, and name. Subword tokenization creates a smaller, more efficient vocabulary that can handle any text, including new words it has never seen.

How many tokens is a word?

On average, one token equals about 0.75 words in English. Common words like 'the' are usually one token, while longer or rare words may be split into multiple tokens.

Why does tokenization affect AI pricing?

AI APIs charge per token because tokens determine how much computation the model needs. More tokens means more processing, which costs more money.

What is Tokenization in AI?

Q: What is tokenization in AI?

Tokenization is the process of breaking text into smaller pieces called tokens. These tokens can be words, parts of words, or even individual characters. It's the first step before any AI model can process your text.

You type a question into ChatGPT. Before the model can even start thinking about your words, it faces a fundamental problem: computers don't understand text. They only understand numbers.

Tokenization is the bridge between human language and machine computation. It's the process of breaking your text into small pieces called tokens and assigning each piece a number.

Without tokenization, AI language models simply cannot function.

The Problem: Why Not Just Use Whole Words?

At first glance, the obvious solution seems simple: give every word its own number.

"cat" = 1
"dog" = 2
"the" = 3
And so on...

But this approach falls apart quickly:

Vocabulary explosion: English has over 170,000 words in current use. Add names, technical terms, typos, slang, and other languages, and you need millions of entries.
Unknown words: What happens when someone types "ChatGPTify" or "unfriendliest"? A word-based system has never seen these and cannot process them.
Compound languages: German words like "Donaudampfschifffahrtsgesellschaft" (Danube steamship company) are single words that would each need their own entry.
Inefficiency: Storing millions of unique word vectors wastes memory and slows down the model.

The solution? Subword tokenization.

How Modern Tokenization Works

Instead of whole words, modern LLMs break text into subword units. These are pieces that are bigger than individual letters but often smaller than complete words.

The Core Idea

The tokenizer learns which character combinations appear frequently in text and groups them together. Common words stay intact. Rare words get split into recognizable pieces.

For example:

"unhappiness" might become: ["un", "happiness"] or ["un", "hap", "pi", "ness"]
"ChatGPT" might become: ["Chat", "G", "PT"]
"the" stays as: ["the"]

This approach gives you the best of both worlds: a manageable vocabulary size (typically 32,000 to 100,000 tokens) that can represent any possible text.

Byte-Pair Encoding (BPE): The Most Popular Method

Most modern LLMs use a technique called Byte-Pair Encoding (BPE). Here's how it works:

Training Phase (Done Once by Model Creators)

Start with characters: Begin with every individual character as its own token.
Count pairs: Look at a massive text corpus and count how often each pair of adjacent tokens appears.
Merge the most common pair: Take the most frequent pair and combine them into a new token.
Repeat: Do this thousands of times until you reach your target vocabulary size.

A Simple Example

Imagine training on the text: "low lower lowest"

Starting vocabulary: [l, o, w, e, r, s, t, (space)]

Iteration 1: "lo" appears most often. Merge into new token "lo". Iteration 2: "low" appears frequently. Merge "lo" + "w" into "low". Iteration 3: "er" is common. Merge into "er". Iteration 4: "est" is common. Merge "e" + "st" into "est".

After many iterations, you end up with tokens like:

"low" (very common, stays whole)
"er" (common suffix)
"est" (common suffix)

Now "lowest" tokenizes as ["low", "est"] instead of 6 individual characters.

Which Models Use BPE?

Model Family	Tokenizer Type
GPT-2, GPT-3, GPT-4	BPE
Llama, Llama 2, Llama 3	Byte-level BPE
Claude	Custom (BPE-based)
Mistral	BPE

Other Tokenization Methods

WordPiece (Used by BERT)

Similar to BPE, but instead of merging the most frequent pairs, WordPiece merges pairs that maximize the likelihood of the training data. BERT and its variants use this method.

WordPiece marks subwords with "##" to show they continue a previous token:

"unhappiness" becomes ["un", "##happiness"]

SentencePiece (Used by T5, ALBERT)

Treats the input as a raw stream of characters, including spaces. This makes it language-agnostic and useful for multilingual models. A space becomes a special character "▁" at the start of words.

Unigram (Used alongside SentencePiece)

Starts with a large vocabulary and removes tokens that contribute least to the overall probability of the text. It's the opposite approach from BPE.

What Happens When You Send Text to an LLM

Let's trace through a real example.

Your input: "The quick brown fox"

Step 1: Tokenization The tokenizer breaks this into tokens:

["The", " quick", " brown", " fox"]

(Note: spaces are often attached to the following word)

Step 2: Token IDs Each token maps to a number in the vocabulary:

[464, 2068, 7586, 21831]

Step 3: Into the Model These numbers go through the embedding layer and into the transformer. The model never sees your original text, only these IDs.

Step 4: Output The model generates output token IDs, which the tokenizer converts back to text.

Why Tokenization Matters for You

1. It Affects Pricing

AI APIs charge per token. Understanding tokenization helps you estimate costs:

Text	Approximate Tokens
"Hello"	1 token
"Hello, world!"	3-4 tokens
1 page of text	~500 tokens
Average email	100-300 tokens

Rule of thumb: 1 token ≈ 4 characters or 0.75 words in English.

2. It Affects Context Limits

When a model has a "128K context window," that means 128,000 tokens, not words. Your actual word limit is roughly 75% of that.

3. It Explains Weird Model Behavior

Ever wonder why AI models:

Struggle with counting letters? "How many r's in strawberry?" The model sees ["str", "aw", "berry"] and the letter boundaries are hidden.
Have trouble with simple arithmetic? Numbers tokenize inconsistently. "1234" might be one token, while "12345" becomes two.
Handle code differently than prose? Code has different token patterns. Variable names and syntax get split in unexpected ways.

4. It Varies by Language

Tokenizers trained primarily on English are less efficient for other languages:

Language	Tokens for "Hello, how are you?" equivalent
English	~6 tokens
Chinese	~10-15 tokens
Japanese	~12-18 tokens
Arabic	~8-12 tokens

This means non-English users often pay more and use more context for the same content.

Tokenization in Practice: Try It Yourself

Most AI providers offer tokenizer tools:

OpenAI: The tiktoken library lets you see exactly how GPT models tokenize text.
Hugging Face: Every model page includes a tokenizer you can test.
Anthropic: Claude's tokenization is similar to GPT but with some differences.

Here's a quick Python example using tiktoken:

python

import tiktoken

# Load the GPT-4 tokenizer
enc = tiktoken.encoding_for_model("gpt-4")

text = "Tokenization is fascinating!"
tokens = enc.encode(text)

print(f"Text: {text}")
print(f"Tokens: {tokens}")
print(f"Token count: {len(tokens)}")

# Decode back to see the pieces
for token in tokens:
    print(f"  {token} -> '{enc.decode([token])}'")

Output:

Text: Tokenization is fascinating!
Tokens: [3404, 2065, 374, 27387, 0]
Token count: 5
  3404 -> 'Token'
  2065 -> 'ization'
  374 -> ' is'
  27387 -> ' fascinating'
  0 -> '!'

Key Takeaways

Concept	What to Remember
What is tokenization?	Breaking text into smaller pieces (tokens) that the model can process
Why subwords?	Balances vocabulary size with the ability to handle any text
BPE	Most common method, used by GPT and Llama families
Token ≈ 0.75 words	Rule of thumb for estimating token counts
Why it matters	Affects pricing, context limits, and model behavior

Tokenization might seem like a mundane preprocessing step, but it's the foundation that makes language models possible. Every word you type passes through this invisible translation layer before AI can understand it.

What's Next?

Now that you understand how text becomes tokens, the next step is learning how those tokens become meaningful representations. Check out our article on embeddings to see how token IDs transform into rich numerical vectors that capture meaning.

What is Tokenization in AI? How LLMs Read Your Text