Deconstructing the Giants: A Technical Deep Dive into LLM Architecture, Performance, and Cost
What does the '7B' on an LLM really mean? This article provides a rigorous breakdown of the Transformer architecture, showing exactly where those billions of parameters come from and how they directly impact VRAM, latency, cost, and concurrency in real-world deployments.

The headline specification of a Large Language Model (LLM)—be it 7B, 70B, or 175B—is its parameter count. This single number is the most significant indicator of its potential capability, its operational cost, and the hardware required to run it. This article provides a technically rigorous breakdown of the Transformer architecture, showing precisely how these billions of parameters are calculated and how they translate into real-world performance and resource consumption.
The Anatomy of a 7B Model: A Parameter Audit
A "parameter" in an LLM is a trainable variable, typically a weight or a bias in a matrix, that the model learns during training. The "7B" signifies a total of seven billion such parameters. To understand where they come from, we will perform an audit using a standard 7B model architecture, like Llama 2 7B, as our blueprint.
Core Hyperparameters:
- Vocabulary Size (
vocab_size): 32,000 tokens - Model Dimension (
d_model): 4096 (This is the width of the vectors that flow through the model) - Number of Hidden Layers (
num_hidden_layers): 32 (This is the depth of the model) - Feed-Forward Intermediate Size (
ffn_intermediate_size): 11008 - Number of Attention Heads (
num_attention_heads): 32
1. The Token Embedding Layer
This layer converts discrete token IDs into dense vectors. It is a single weight matrix.
- Calculation:
vocab_size * d_model - Example:
32,000 * 4096 = 131,072,000parameters
2. The Transformer Blocks (The Inference Core Engine)
The bulk of the parameters reside in the stack of 32 identical Transformer Blocks. The total is num_hidden_layers multiplied by the parameters per block. Let's dissect one block.
A. Multi-Head Self-Attention (MHSA) The attention mechanism consists of four primary weight matrices:
- Three matrices for generating the Query, Key, and Value vectors (
W_q,W_k,W_v). - One output projection matrix (
W_o) to combine the results from all attention heads.
Each of these matrices has a shape of d_model x d_model.
- Calculation:
4 * (d_model * d_model) - Example:
4 * (4096 * 4096) = 4 * 16,777,216 = 67,108,864parameters
B. The Feed-Forward Network (FFN) This is the single largest component within each block. It consists of two linear layers that provide the model with most of its representational capacity.
- Up-Projection Layer: Expands the dimension from
d_modeltoffn_intermediate_size. - Down-Projection Layer: Shrinks the dimension from
ffn_intermediate_sizeback tod_model.
- Calculation:
(d_model * ffn_intermediate_size) + (ffn_intermediate_size * d_model) - Example:
(4096 * 11008) + (11008 * 4096) = 45,088,768 + 45,088,768 = 90,177,536parameters
C. Layer Normalization Parameters
Each Transformer block typically has two LayerNorm modules. Each module has two trainable parameters (a gain/scale γ and a bias/shift β) for each dimension of the input vector.
- Calculation per block:
2 * (2 * d_model) - Example:
2 * (2 * 4096) = 16,384parameters (a negligible amount, but included for completeness)
Total Parameters Per Block:
67,108,864 (Attention) + 90,177,536 (FFN) + 16,384 (LayerNorm) ≈ 157.3 Million
Total Across All Blocks:
32 layers * 157,302,784 params/layer = 5,033,689,088 parameters
3. The Final Prediction Head (LM Head)
At the very end, a final linear layer maps the output vector back to the vocabulary space. This is often called the "un-embedding" layer and its weights can sometimes be tied to the initial token embedding matrix.
- Calculation:
d_model * vocab_size - Example:
4096 * 32,000 = 131,072,000parameters
Grand Total Calculation
- Token Embeddings: 131,072,000
- Transformer Blocks: 5,033,689,088
- LM Head: 131,072,000
- Other (final LayerNorm, etc.): ~4,096
- Approximate Total: ~5.3 Billion Parameters
Note: The actual Llama 7B model has additional nuances and parameter sharing schemes that bring the final count to just under 7 billion. This calculation demonstrates that the vast majority of parameters are in the FFN and Attention matrices, repeated across dozens of layers.
Architectural Choices and Their Impact on Capability
These hyperparameters are deliberately chosen based on Scaling Laws—empirical research that links a model's size, dataset size, and computational budget to its final performance.
d_model(Width): A largerd_modelallows each token's vector to hold more rich and nuanced information. It directly increases the model's representational density.num_hidden_layers(Depth): Depth enables the model to build a hierarchical understanding of language. Early layers may handle syntax, mid-layers may process semantics, and final layers can infer abstract concepts like intent. Deeper models excel at complex reasoning.ffn_intermediate_size: The FFN's expansion factor is critical for a model's ability to store and access factual knowledge learned during training. Increasing this is a highly parameter-efficient way to boost a model's "intelligence."
From Architecture to Execution: Hardware and Resource Planning
Understanding the architecture allows us to precisely calculate the resources required for deployment.
1. Static Memory: Loading the Model Weights
This is the fixed memory (VRAM for a GPU, RAM for a CPU) or disk space needed just to load the model. It depends on the numerical precision used for the parameters.
Precision | Bytes per Parameter | Calculation for 7B Model | Required VRAM/Disk |
|---|---|---|---|
FP32 (Full) | 4 bytes | 7e9 * 4 | ~28 GB |
FP16 / BFloat16 | 2 bytes | 7e9 * 2 | ~14 GB |
INT8 (Quantized) | 1 byte | 7e9 * 1 | ~7 GB |
INT4 (Quantized) | ~0.5 bytes | 7e9 * 0.5 | ~3.5 GB |
This calculation dictates the minimum VRAM a GPU must have. A 7B model at FP16 precision will not fit on a GPU with 12GB of VRAM.
2. Dynamic Memory: The KV Cache and Concurrency
During inference, the model generates a Key-Value (KV) cache for every token in the input sequence. This cache prevents re-computation of attention for previous tokens and is the primary consumer of dynamic memory.
- KV Cache Size Formula:
2 * num_layers * d_model * context_length * bytes_per_element - This memory cost is per user request.
Concurrency Scenario: Let's plan for serving a 4-bit quantized 7B model on a single NVIDIA RTX 4090 with 24GB of VRAM.
- Static Memory Cost: The 4-bit model weights will consume ~4 GB of VRAM.
- Available Dynamic VRAM:
24 GB - 4 GB = 20 GB. - Dynamic Cost per Request: Let's assume a context length of 4096 tokens. Even with a quantized model, the KV cache is often stored at a higher precision (e.g., FP16) for accuracy.
- KV Cache per request (FP16):
2 * 32 * 4096 * 4096 * 2 bytes ≈ 2.15 GB - This is too large. Modern systems use quantized KV caches. Let's assume an 8-bit cache:
- KV Cache per request (INT8):
2 * 32 * 4096 * 4096 * 1 byte ≈ 1.07 GB
- Concurrency Calculation:
Available Dynamic VRAM / KV Cache per Request
20 GB / 1.07 GB ≈ 18
Conclusion: This single GPU could handle a concurrent batch of approximately 18 users, each with a 4096-token context, before exhausting its VRAM. This demonstrates that concurrency is limited directly by the KV cache and available VRAM.
3. Computational Performance: CPU vs. GPU
The core computation in a Transformer is the General Matrix Multiply (GEMM) operation.
- GPU (Throughput Engine): A GPU is a massively parallel processor with thousands of cores (e.g., Tensor Cores on NVIDIA hardware) designed specifically to accelerate the
multiply-and-addoperations that constitute a matrix multiplication. It can perform trillions of these operations per second. - CPU (Latency Engine): A CPU has a small number of powerful cores designed for complex, sequential tasks. While it can perform matrix multiplication, it cannot match the parallelism of a GPU, making it 10x to 100x slower for LLM inference.
Running a 7B model on a CPU is feasible using frameworks like llama.cpp, but it is suitable for development or non-real-time tasks. For any application requiring low latency, a GPU is essential. The model's width (d_model) is the primary driver of computational density, while its depth (num_layers) is the main contributor to sequential latency.
Conclusion: From Theory to Practical Engineering
This deep dive demonstrates that the "7B" headline figure is far more than a marketing number; it is the direct outcome of a series of deliberate architectural choices. We have shown that the overwhelming majority of these parameters are concentrated in the Multi-Head Self-Attention and Feed-Forward Network components, which are replicated across dozens of layers to give the model its analytical depth.
The key takeaway is the direct and calculable relationship between these abstract hyperparameters and the concrete realities of deployment.
num_paramsdictates the static VRAM and disk footprint.d_modeldrives the computational intensity (FLOPs).num_layerscreates sequential latency.- The KV Cache, a function of depth, width, and context length, governs the maximum user concurrency.
By understanding these fundamental calculations, engineers can move beyond treating LLMs as black boxes. They can precisely forecast hardware costs, optimize performance through techniques like quantization, and design robust, scalable systems that balance capability with efficiency. This architectural literacy is no longer optional—it is essential for building the next generation of intelligent applications.


