LLM Quantization Explained: An Engineer's Guide to FP32, Int8, GGUF & AWQ
Why shrinking your model is like compressing a JPEG—and how to do it without lobotomizing your AI.

If you've ever tried to run a localized model (like Llama-3 or Mistral) on your laptop, you've probably hit a wall. You load the model, your fan screams, and then you get the dreaded error: "CUDA Out of Memory."
The problem isn't that your computer is bad. The problem is that LLMs are mathematically heavy.
To solve this, engineers use a technique called Quantization. It's the reason you can run a chatbot on a MacBook Air that used to require a $20,000 server. But how does it work? Does it make the model "stupider"? And what is the difference between Float16, Int8, and GGUF?
This is the non-math guide to the most important optimization in AI.
The Core Problem: "Weights" are Heavy
To understand quantization, you first need to understand what a "Model" actually is.
An LLM is essentially a file filled with billions of numbers called parameters (or weights). These numbers represent the "knowledge" the model has learned. When you ask the model a question, your computer has to load all these numbers into its temporary memory (VRAM) and do math with them.
By default, these numbers are stored in a format called Float32.
The "Resolution" Analogy
Imagine a high-definition photo.
- Float32 (Full Precision): This is a raw, uncompressed 8K image. Every pixel is perfect. It takes up a massive amount of space.
- Int8 (Quantized): This is a standard JPEG. It looks almost identical to the naked eye, but the file size is 4x smaller.
Quantization is simply the process of lowering the "resolution" of the numbers inside the model to save space.
The Dictionary of Data Types
Let's look at the formats you will see on HuggingFace or inside tools like Ollama.
FP32 (Float 32)
- The "Perfect" Copy.
- What is it? Standard scientific notation. It can represent tiny numbers (
0.0000001) and massive numbers (1,000,000) with extreme accuracy. - The Cost: It takes 4 Bytes of memory per number.
- The Math: A 7 Billion parameter model in FP32 requires 28 GB of VRAM. (Note: Most consumer cards only have 8GB to 24GB).
- Verdict: Overkill for running models. Only used for scientific research.
FP16 (Float 16)
- The Old Standard.
- What is it? We cut the memory usage in half. It uses 2 Bytes per number.
- The Cost: 7B Model = 14 GB of VRAM.
- The Problem: FP16 has a "range" problem. If a number gets too big (like during complex training), FP16 runs out of room and crashes (returns
NaNor Error).
BF16 (Brain Float 16)
- The Industry Standard (Google/Nvidia).
- What is it? This is a clever engineering hack. It takes up the same space as FP16 (2 Bytes), but it changes how the bits are arranged.
- The Logic: It sacrifices a little bit of precision (fewer decimal places) to handle huge numbers without crashing.
- Verdict: If you are training or fine-tuning a model today, you use BF16. It is stable and efficient.
Int8 (Integer 8)
- The Inference King.
- What is it? This format stops using decimals entirely. It forces every number to be a whole integer between -127 and +127.
- The Cost: It takes 1 Byte per number.
- The Math: 7B Model = 7 GB of VRAM.
- Verdict: This fits comfortably on a gaming laptop or a decent cloud GPU.
Why does Quantization make things faster?
It's not just about fitting the model in memory. It's about the Memory Bandwidth.
Think of your GPU like a kitchen:
- The Chef (The Compute Core): The part that does the math.
- The Fridge (The VRAM): Where the ingredients (numbers) are stored.
- The Hallway (Bandwidth): The path between the Fridge and the Chef.
In AI, the Chef is incredibly fast. The bottleneck is almost always the Hallway. The Chef is constantly waiting for ingredients to arrive.
Quantization creates "Smaller Boxes."
- If you use FP32, you can only carry 1 ingredients(or weight) per trip down the hallway.
- If you use Int8, you can carry 4 ingredients(or weights) per trip in the same box.
Result: The model generates text 2x to 4x faster because the GPU spends less time waiting for data to arrive.
Visualizing the "Loss" (Does the model get stupid?)
You might be thinking: "If we round 3.14159 down to 3, doesn't the model lose intelligence?"
Yes and No.
The Grid Analogy
Imagine a ruler.
- FP32 has microscopic tick marks. You can measure anything perfectly.
- Int8 only has 256 tick marks total. You have to snap your measurement to the nearest mark.
The Magic of Large Models: LLMs have a lot of "redundancy." They have billions of parameters. If one neuron loses a bit of precision and says "Cat" with 98% confidence instead of 99% confidence, the next neuron usually corrects it or doesn't care.
- 70B Parameter Models: You can compress these heavily (even down to 4-bit!) with almost zero perceptible difference in intelligence.
- Small Models (7B or 8B): These are more fragile. If you compress them too much (like 3-bit or 2-bit), they start "hallucinating" or speaking gibberish because they don't have enough spare neurons to compensate for the rounding errors.
When does Quantization actually happen?
Quantization is usually a "baked-in" decision, not a runtime switch.
Think of it like buying a movie:
- The Cinema Master: The raw footage (FP32/BF16).
- The Blu-Ray: High quality, but compressed (Int8).
- The Mobile Stream: Heavily compressed (Int4).
When you go to the meta-llama official repository on HuggingFace, you are downloading the Cinema Master. By default, standard models are released in BF16 (Bfloat16) or FP16. This is why the files are huge.
The "Shadow" Repositories
To get the quantized versions, you usually have to go to community-maintained repositories.
- If you search for
Llama-3-8B, you get the official, big version. - If you search for
Llama-3-8B-GGUForLlama-3-8B-AWQ, you will find versions uploaded by optimization experts.
The Limit—Why FP32? Why not FP64 or FP128?
If FP32 (Float32) is "High Definition," wouldn't FP64 (Double Precision) be "8K Ultra HD"? Why don't we use it for the model weights?
The "Microns to the Grocery Store" Logic
Imagine you ask: "How far is the grocery store?"
- FP16 answer: "1.5 miles." (Good enough to drive there).
- FP32 answer: "1.500023 miles." (Extremely precise).
- FP64 answer: "1.5000230000000051 miles."
Logic: In AI, the data itself is "noisy." Human language is messy/ambiguous. Training a neural network on vague concepts like "sarcasm" or "poetry" using FP64 precision is like measuring the distance to the store in microns. The "noise" in the data is louder than the precision of the number. It is wasted effort.
The Hardware Reality (Transistors)
This is the real bottleneck. A GPU is a physical chip made of transistors.
- An FP32 calculation requires a specific arrangement of logic gates on the silicon.
- An FP64 calculation requires massively more physical transistors and energy to compute.
NVIDIA's Choice: NVIDIA decided years ago that for AI and Gaming,
speed > extreme precision.
- Consumer GPUs (RTX 3090/4090) are intentionally crippled at FP64. They are incredibly slow at it because they simply don't have many "FP64 cores" on the chip.
- Scientific GPUs (A100/H100) can do FP64 (for weather simulation or nuclear physics), but even then, AI researchers turn it off because it uses 2x the memory for 0% gain in model intelligence.
Is there a cap? The "cap" is determined by the hardware manufacturers (NVIDIA/AMD). They optimize their chips for FP16 and BF16 because that's where the AI industry gets the best Return on Investment (ROI).
Why is Quantization Hard?
This is the one technical concept you should know.
Imagine you are taking a group photo of your friends. Everyone is roughly 5'10". Suddenly, an NBA player who is 7'5" joins the photo.
To fit the NBA player in the frame, the camera has to "zoom out." Now, your friends look tiny. You lose the detail on their faces because the scale was forced to change to accommodate the giant.
This happens in AI models.
- Most neurons output small numbers (0.1, 0.05).
- But occasionally, one "outlier" neuron screams a value of 100.0.
- To fit that 100.0 into our small Int8 grid, we have to "zoom out" our math.
- The Result: All the subtle 0.05 numbers get rounded down to 0. The detail is lost.
The Fix: Modern quantization methods (like AWQ or GGUF) detect these "NBA Player" neurons and treat them specifically with higher precision, while compressing the rest of the crowd.
The Format Wars—AWQ vs GGUF
When you look for quantized models, you will see these two acronyms everywhere. They represent two different philosophies of how to run AI.
GGUF (The "Everywhere" Format)
- Stands For: GPT-Generated Unified Format.
- The Vibe: "I want to run this on my MacBook, my Windows CPU, or my Android phone."
- How it works: GGUF is designed for CPU and Apple Metal (M1/M2/M3 chips).
- The "Split" Feature: Traditionally, models were split into shards (part1, part2). GGUF combines everything (weights + config + tokenizer) into a single file.
- Use Case: If you are using LM Studio, Ollama, or running a model locally on a laptop without a massive NVIDIA card, you want GGUF.
AWQ (The "GPU Speed" Format)
- Stands For: Activation-aware Weight Quantization.
- The Vibe: "I have a dedicated NVIDIA GPU and I want maximum speed."
- The Logic: Remember the "Outlier" problem (the NBA player in the group photo)?
- Older formats (like GPTQ) treated every number equally.
- AWQ looks at the model while it is thinking (looking at activations). It identifies the 1% of "salient" weights—the ones that matter the most for accuracy—and keeps them in higher precision, while crushing the other 99%.
- Use Case: If you are building a production API using vLLM or have an RTX 4090 and want the fastest possible tokens-per-second, you use AWQ.
Which Quantized Model Should You Use?
If you are a developer or a hobbyist, here is your cheat sheet:
Your Goal | Recommended Format | Why? |
|---|---|---|
I just want to chat with a model on my Mac/PC. | Int4 (GGUF) | This is the standard for local tools (Ollama/LM Studio). It shrinks the model enough to run fast but keeps it smart enough to work. |
I am running a model on a Production Server. | Int8 or FP8 | Safe, fast, and cost-effective. Offers a 2x speed boost over standard BF16. |
I am Fine-Tuning a model on my data. | BF16 or QLoRA | Never train in pure Int8. You need the stability of BF16, or use QLoRA (which mixes 4-bit loading with 16-bit training). |
I am doing scientific research. | FP32 | Precision matters more than speed. |
Final Thought
Quantization isn't just about saving money. It's about accessibility. It turns massive super-computer models into software that can run on your laptop. It is the bridge between "AI Research" and "AI Reality."


