From Classifier to Creator: The Generative Leap

In our previous article on neural networks, the model’s job was straightforward: look at a complete input (say, a photo of a cat) and output one final answer — “cat” or “dog.”

A generative model does something far more ambitious.

Its only job is breathtakingly simple:

“Given a sequence of things, predict the very next thing.”

That’s it.

For a Large Language Model (LLM), this means: “Given these words so far, what’s the most likely next word?”

The magic of writing an entire paragraph, story, or even a poem isn’t one giant creative act. It’s thousands of tiny predictions made one after another in a loop:

You type: The first person on the Moon was
The model predicts the single most likely next word → Neil
It appends that word and asks again: now what? → Armstrong
Repeat hundreds or thousands of times

This iterative “next-token prediction” is how all modern generative AI works — text, images, music, code, you name it.

So the real challenge isn’t generation itself. It’s making really good predictions.

To predict the next word accurately, the model must deeply understand everything that came before. That’s where older architectures failed spectacularly… and where the Transformer changed everything.

The Old Way: Why RNNs Couldn’t Remember

Before 2017, the dominant way to process sequences was the Recurrent Neural Network (RNN). The idea was intuitive: read one word at a time and keep a running “memory” (called the hidden state) of what you’ve seen.

It worked great for short sentences. But two fatal problems appeared with longer text:

Problem 1: The Vanishing Gradient (The Memory That Faded Away)

When training an RNN, error signals have to travel backward through time to update the network. At every step backward, that signal gets multiplied by weights usually smaller than 1.

Result? The signal shrinks exponentially.

After just 20–30 steps, it’s effectively zero.

text

Example: "I grew up in France… [300 words later] …therefore I speak French."

If the model wrongly predicts “English” instead of “French,” the error signal has to travel 300 steps back to the word “France.” By the time it gets there, the signal is gone. The model never learns the long-range connection.

This is the classic long-range dependency problem.

Problem 2: The Sequential Bottleneck (Death by Slowness)

RNNs process one word at a time. You can’t compute word #50 until word #49 is done.

Modern GPUs hate that. They’re built for massive parallelism, but RNNs force them to sit idle most of the time. Training on long sequences became painfully slow.

Even fancier RNNs like LSTMs and GRUs helped with vanishing gradients a bit, but they still couldn’t escape the sequential bottleneck.

The world needed a new architecture.

The Transformer Revolution (2017)

In the now-famous paper “Attention Is All You Need”, Vaswani et al. threw out recurrence entirely.

Core idea: Instead of reading sequentially, let the model look at every word in the input at the same time and figure out which ones matter most for each prediction.

This is made possible by the Attention mechanism — the true hero of modern AI.

How Attention Actually Works (Simple Analogy)

Imagine you’re trying to understand the pronoun “it” in:

The robot picked up the ball because it was heavy.

Each word sends out three things:

Query: “What am I looking for?”
Key: “Here’s what I’m about” (noun, verb, object, etc.)
Value: “Here’s my actual meaning”

The model compares the Query of “it” against the Keys of every other word. “ball” gets a very high attention score. “robot” gets a low one.

Then it blends the Values accordingly — so “it” becomes richly connected to “ball,” not “robot.”

This happens simultaneously for every single word in the input. No sequential processing. Perfect for GPUs.

But Doesn’t Attention Look at the Entire Vocabulary?

No — huge misconception!

During inference (when you chat with the model), Attention only looks at the tokens currently inside the model’s context window — never the entire training dataset.

The Librarian Analogy

Training phase = A genius librarian who spent years reading every book in the world’s biggest library. They didn’t memorize every sentence; they internalized patterns, facts, grammar, and relationships into billions of neural weights.

Inference phase = You walk up and ask a question. The librarian doesn’t run back into the stacks. All the knowledge is already in their head. They only need the tiny notepad containing your current conversation — that’s the context window.

Context Window in Practice

Early GPT-3: ~2,000–4,000 tokens
GPT-4 Turbo: 128k tokens (~100 pages of text)
Gemini 2.5 Pro: 1 million tokens with 2 million tokens expected in Q3 2025

Everything outside the current context window is completely invisible to the model during generation. That’s why long chats eventually “forget” what was said 50,000 words ago.

Step-by-Step Example

You type: The cat sat on the

Tokens loaded: [The, cat, sat, on, the] (5 tokens)
Model predicts next token → mat (using Attention over just these 5 tokens)
Appends it → [The, cat, sat, on, the, mat]
Repeats for the next token, and the next…

That loop, powered by full attention over the sliding context window, is literally how every word you see from ChatGPT is born.

The Same Trick Works for Images Too

Models like Stable Diffusion, DALL·E 3, Midjourney, Flux, etc. all use Transformers under the hood. They just apply the same “predict the next thing” idea to pixels (or noise → image steps) instead of words.

Your text prompt gets turned into embeddings, a Transformer pays attention across them, and that guides the denoising process until a coherent image appears.

Conclusion

Generative AI feels magical, but it boils down to one profoundly powerful trick:

Predict the next thing, over and over, really really well.

The Transformer + Attention made that prediction astonishingly good by letting every piece of the input talk to every other piece instantly and in parallel.

That’s the leap from classifier to creator.

Next Up

In the coming articles we’ll explore:

How the inference engine actually runs on hardware
Why longer context is expensive
Tokens vs words
And more LLM fundamentals every power user should know.