CS336 Lecture 3 - Architectures and Hyperparameters

Please read along with the original slides

The modern Transformer recipe
#

The original Transformer is still the conceptual starting point, but modern large language models are not usually exact copies of it.

A typical modern dense LLM block looks more like this:

input hidden states enter a normalisation layer;
the normalised states go into causal self-attention;
the attention result is added back through a residual connection;
another normalisation layer is applied;
the result goes into a feed-forward network, usually with a gated activation;
another residual addition produces the block output.

In rough form:

x = x + Attention(Norm(x))
x = x + FFN(Norm(X))

This is called pre-norm, because the normalisation happens before the attention or FFN sublayer.

The important design idea is that the residual stream remains a relatively clean path through the network. The normalisation affects the branch computation, but it does not directly sit on the main residual path after every addition.

That sounds like a small rearrangement, but it matters a lot for training stability.

Pre-norm vs post-norm
#

The original Transformer used post-norm:

x = Norm(x + Attention(x))
x = Norm(x + FFN(x))

Modern LLMs usually use pre-norm:

x = x + Attention(Norm(x))
x = x + FFN(Norm(X))

The practical reason is stability.

With post-norm, the normalisation is placed after the residual addition. This can interfere with the residual signal path and can make gradients behave badly in deeper networks. The lecture discusses two related explanations:

gradient attenuation, where gradients shrink or become poorly conditioned through depth;
gradient spikes, where training becomes unstable and requires careful warmup or smaller learning rates.

Pre-norm became the standard because it tends to make large models easier to train. It helps preserve the residual stream and allows larger learning rates or less fragile warmup schedules.

A useful mental model:

Post-norm says: “normalise the result after mixing the residual and new computation.”
Pre-norm says: “normalise only the input to the new computation, and leave the residual highway mostly untouched.”

That second design is friendlier to very deep networks.

Some newer models go further and add extra normalisation outside the residual stream. This is sometimes called double norm or non-residual post-norm. The motivation is not to return to the old post-norm design, but to add additional control without damaging the residual pathway.

LayerNorm vs RMSNorm
#

The original Transformer used LayerNorm. Many modern LLMs use RMSNorm instead.

LayerNorm normalises both the mean and the variance of the hidden vector.

$$ y = \frac{x - \mathbb{E}[x]}{\sqrt{\mathrm{Var}[x] + \epsilon}} - \gamma + \beta $$

RMSNorm is simpler: it normalises using the root mean square and does not subtract the mean.

$$ y = \frac{x}{\sqrt{|x|^2_2 + \epsilon}} - \gamma $$

The lecture’s main point is that RMSNorm is not popular because it radically changes the model’s expressive power. It is popular because it is cheaper and works about as well.

The advantages are:

fewer operations, because it does not compute the mean;
fewer parameters, because it usually does not use a bias term;
less data movement;
better wall-clock performance in practice.

A key systems lesson appears here:

FLOPs are not runtime.

Even if normalisation is a tiny fraction of total FLOPs, it can still matter because runtime is often affected by memory movement, kernel launches, and bandwidth. Matrix multiplications dominate FLOPs, but small operations can still hurt performance if they move data inefficiently.

So RMSNorm is a good example of an architectural choice that looks mathematically minor but is useful from a systems perspective.

Dropping bias terms
#

Modern Transformer implementations often remove bias terms from linear layers and normalisation layers.

For example, instead of:

FFN(x) = activation(xW1 + b1)W2 + B2

many modern modals use something closer to:

FFN(x) = activation(xW1)W2)

This is not because bias terms are impossible to use. Older models used them. The argument is more pragmatic:

bias terms add parameters;
they add memory movement;
they often do not provide a large enough benefit;
removing them can slightly simplify optimisation and implementation.

This is part of a broader modern LLM design trend: if a component costs memory bandwidth but does not clearly improve quality, people tend to remove it.

Feed-forward activations: from ReLU to GLU
#

The feed-forward network is a major part of each Transformer block. In many LLMs, it contains a large fraction of the model’s parameters and compute.

Older models used activations like:

ReLU
GELU
Swish

Modern models increasingly use gated activations, especially:

GeGLU
SwiGLU

A standard FFN looks like this:

FFN(x) = activation(xW1)W2

A gated FFN adds another linear projection and multiplies the two branches elementwise:

FFN(x) = (activation(xW1) ⊙ xV)W2

For SwiGLU:

SwiGLU(x) = Swish(xW1) ⊙ xV

The intuition is that the model gets a learned gate. Instead of merely transforming features, it can also decide which feature channels should pass through more strongly.

Gated variants add parameters, so models usually reduce the feed-forward hidden dimension when using them. A common rule is:

standard FFN: $d_{\text{ff}} \approx 4 d_{\text{model}}$
GLU-style FFN: $d_{\text{ff}} \approx \frac{8}{3} d_{\text{model}}$

This keeps the parameter count roughly comparable.

The lecture’s practical conclusion is:

ReLU and GELU can still work.
GPT-3 used GELU and obviously worked.
But most recent models have moved towards SwiGLU or GeGLU.
The empirical evidence suggests gated activations give fairly consistent gains.

So, for a modern LLM implementation, SwiGLU + RMSNorm + pre-norm is a very normal choice.

Serial vs parallel Transformer blocks
#

The usual Transformer block is serial:

x = x + Attention(Norm(x))
x = x + MLP(Norm(x))

Attention is computed first, then the MLP.

Some models use a parallel block:

x = x + Attention(Norm(x)) + MLP(Norm(x))

This can be faster if implemented carefully, because:

the same normalised input can be shared;
matrix multiplications may be fused;
attention and MLP branches can be scheduled more efficiently.

Models such as GPT-J, PaLM, and GPT-NeoX used parallel layers.

However, the lecture notes that most models still use the serial design. Parallel layers are interesting, but they have not become the universal default.

The practical takeaway:

serial blocks are the safe, standard choice;
parallel blocks can be useful for efficiency;
but parallelisation is an implementation and scaling trade-off, not a guaranteed quality improvement.

Position embeddings
#

Position information is necessary because attention by itself does not know token order. Without positional information, a Transformer would treat a sequence too much like an unordered set.

The lecture covers several position embedding families.

Sinusoidal embeddings
#

The original Transformer used fixed sine and cosine functions.

The model adds a position-dependent vector to each token embedding:

embedding(token, position) = token_embedding + sinusoidal_position_vector

This gives the model a smooth notion of position, but it is still an additive absolute-position method.

Learned absolute embeddings
#

Models like GPT-2 and GPT-3 used learned position embeddings.

embedding(token, position) = token_embedding + learned_position_vector

This is simple and effective, but it is tied to absolute positions. It also does not naturally extrapolate to longer sequence lengths.

Relative position embeddings
#

Relative position methods try to make attention depend on the distance between tokens rather than their absolute indices.

Instead of “token at position 17 attends to token at position 4”, the model can reason more like “this token attends to another token 13 positions earlier”.

This is often more natural for language.

RoPE: Rotary Position Embeddings
#

RoPE is now one of the dominant choices in modern LLMs.

The key idea is elegant:

Encode position by rotating query and key vectors, so their inner product depends on relative position.

Rather than adding a position vector to the embedding, RoPE modifies the query and key vectors inside attention.

For each pair of hidden dimensions, RoPE applies a 2D rotation whose angle depends on the token position. When the model computes the dot product between a query and a key, the result naturally contains information about the relative distance between their positions.

A useful way to think about it:

token content gives the base vector;
position rotates that vector;
attention compares rotated query/key vectors;
the comparison depends on relative offset.

This is different from sinusoidal embeddings because RoPE is multiplicative/rotational rather than additive. It avoids some unwanted cross terms produced by simply adding position vectors to token embeddings.

In implementation, RoPE is usually applied to queries and keys, not values.

That detail matters: attention scores come from $\mathrm{Q}\mathrm{K}^\mathrm{T}$, so applying RoPE to $\mathrm{Q}$ and $\mathrm{K}$ directly affects how tokens attend to each other by position.

Feed-forward dimension: why $d_{\text{ff}} \approx 4 d_{\text{model}}$?
#

A standard Transformer FFN expands the hidden dimension and then projects it back down.

If the model dimension is $d_{\text{model}}$, the feed-forward dimension is often:

$$ d_{\text{ff}} = 4 d_{\text{model}} $$

This rule appears again and again across models.

For GLU-style FFNs, because there is an extra gate projection, the expansion is often reduced to:

$$ d_{\text{ff}} \approx \frac{8}{3} d_{\text{model}} $$

The lecture frames this as a surprisingly strong consensus. There are exceptions, but most models stay in a fairly conservative range.

One famous exception is T5-11B, which used an enormous feed-forward multiplier:

d_ff = 65,536
d_model = 1,024

That is a 64x multiplier.

But the lecture is careful here: the fact that something works does not mean it is optimal. T5 v1.1 later moved to a more conventional GeGLU setup with a much smaller multiplier.

In summary:

4x is the boring but strong default;
8/3x is common for GLU variants;
extreme FFN widths can work, but are not obviously the best use of parameters;
most successful LLMs are less adventurous than people might expect.

Attention heads and head dimension
#

A standard multi-head attention setup usually satisfies:

num_heads × head_dim = d_model

For example:

d_model = 4096
num_heads = 32
head_dim = 128

This is not mathematically required. A model could choose a total attention dimension larger or smaller than d_model.

But most models stay close to the simple rule.

There are exceptions, especially in some Google models such as T5 and LaMDA, where the ratio between total head dimension and model dimension can be larger than 1.

The lecture’s attitude here is quite sceptical:

this convention is widely used;
it seems to work;
but there is not necessarily deep validation proving it is uniquely optimal.

So it is a consensus default, not a law of nature.

Deep vs wide: model aspect ratio
#

Another hyperparameter is the model’s aspect ratio:

d_model / num_layers

This roughly asks:

Should the model be wide with fewer layers, or deep with narrower layers?

The lecture notes that many successful models fall into a broad range, often around 100–200, though there are outliers.

There is no single magic number.

The important systems consideration is that extremely deep models are harder to parallelise. Layers are sequential: layer 12 depends on layer 11, which depends on layer 10, and so on. That creates latency and limits parallel execution.

Very wide models, by contrast, can often use larger matrix multiplications, which GPUs are good at.

So the depth/width decision is not only about model quality. It is also about:

training throughput;
inference latency;
pipeline parallelism;
hardware utilisation;
communication cost across devices.

A model that is theoretically elegant but slow to train or serve may be a poor engineering choice.

Vocabulary size
#

Vocabulary size depends heavily on language coverage and production needs.

For mostly monolingual English models, typical vocabulary sizes are around 30k-50k tokens. Examples include GPT-2/3 around 50k and LLaMA around 32k.

For multilingual or production systems, vocabularies are often much larger, around 100k-250k+ tokens. Examples include PaLM, mT5, Qwen, DeepSeek, Gemma, and GPT-4-scale tokenizers.

The reason is simple: multilingual models need to represent many writing systems, languages, scripts, and character combinations. A small vocabulary can make non-English text inefficient, producing too many tokens for the same sentence.

The practical takeaway:

small vocabularies are fine for narrow language coverage;
multilingual models usually need larger vocabularies;
tokenisation is one of the major places where models still differ.

Dropout and regularization
#

Classic neural networks often rely on dropout to prevent overfitting.

For LLM pretraining, the argument against dropout is reasonable:

the dataset is huge;
models often see each token only once or a small number of times;
memorisation is less like the small-data regime;
dropout can slow or destabilise optimisation.

Older models often used dropout, including GPT-2, GPT-3, T5, and OPT.

Newer models often use little or no dropout during pretraining. Instead, they may rely on weight decay.

But weight decay in LLMs is not simply about preventing overfitting. The lecture highlights that weight decay interacts with the learning-rate schedule, especially cosine decay.

So regularization in LLM pretraining is often better understood as part of the optimisation dynamics rather than merely a defence against train/test overfitting.

Stability tricks
#

Large-model training can fail in messy ways. The loss curve may spike, gradients may explode, or the model may become unstable late in training.

The lecture focuses on one dangerous component: Softmax

Softmax involves exponentials and normalisation. If logits become too large, the output can become extremely sharp or numerically unstable.

There are two main softmax locations in an LLM:

the final output softmax over vocabulary;
the attention softmax over tokens.

Modern models use several tricks to keep these stable.

Output Softmax stability: z-loss
#

The z-loss penalises the log normalisation term in the output softmax.

The softmax probability is:

$$ p_i = \frac{e^{z_i}}{Z} $$

where the normalisation term is:

$$ Z = \sum_j e^{z_j} $$

Taking the logarithm gives:

$$ \log p_i = z_i - \log Z $$

The z-loss adds an auxiliary penalty term:

$$ L_z = \alpha (\log Z)^2 $$

where $\alpha$ is usually a very small constant.

The intuition:

if logits become huge, $Z$ becomes huge;
huge logits make the softmax distribution extremely sharp;
penalising $\log Z$ discourages logits from growing uncontrollably.

This improves numerical stability during training, especially for very large language models.

PaLM used this trick, and the lecture lists other models that also adopted it, such as Baichuan 2, DCLM, OLMo 2, and OLMo 3.

Attention Softmax stability: QK norm
#

Attention scores are computed from queries and keys:

$$ \text{scores} = \frac{\mathrm{Q}\mathrm{K}^{\mathrm{T}}}{\sqrt{d_k}} $$

Then softmax is applied.

If Q and K have large norms, the dot products can become large, making the attention softmax too sharp or unstable.

QK-norm normalises queries and keys before they enter the attention softmax.

In simplified form:

$$ \begin{align*} \mathrm{Q} &= \text{Norm}(\mathrm{Q})\ \mathrm{K} &= \text{Norm}(\mathrm{K})\ scores &= \frac{\mathrm{Q}\mathrm{K}^\mathrm{T}}{\sqrt{d_k}}\ \end{align*} $$

This directly targets attention stability.

The lecture notes that QK-norm appears in several recent models, including DCLM, OLMo 2, Gemma 2, Qwen3, OLMo 3, and Gemma 4.

Logit soft-capping
#

Another stability trick is logit soft-capping.

Instead of allowing logits to grow without bound, the model passes them through a tanh-based cap:

logits = soft_cap × tanh(logits / soft_cap)

This keeps logits within a controlled range.

The upside:

prevents attention or output logits from blowing up;
can improve numerical stability.

The downside:

it may hurt performance if the cap restricts useful confidence too much;
it adds another hyperparameter;
it may not be universally beneficial.

So this is a stability tool, but not necessarily a free lunch.

Attention variants: MHA, MQA, and GQA
#

Standard multi-head attention uses separate query, key, and value heads.

The problem becomes especially important during inference.

During text generation, the model generates one token at a time. It stores previous keys and values in a KV cache so it does not need to recompute them for every new token.

However, the KV cache can become large. Moving it in and out of memory can become a bottleneck.

Multi-Query Attention, MQA
#

MQA keeps multiple query heads, but uses only one shared set of key and value projections.

This reduces KV cache size and memory traffic during inference.

The trade-off is that it can hurt quality, because all query heads share the same key/value representation.

Grouped-Query Attention, GQA
#

GQA is a compromise between MHA and MQA. Instead of one shared K/V head for all query heads, groups of query heads share K/V heads.

For example, a model might have 32 query heads, but only 4 key/value heads. In this case, every group of 8 query heads shares one K/V set.

So the spectrum looks roughly like this:

MHA: every query head has its own key/value projections;
GQA: groups of query heads share the same key/value projections;
MQA: all query heads share a single global set of key/value projections.

GQA gives a knob for balancing:

inference efficiency;
KV cache size;
model expressiveness;
quality.

The lecture’s conclusion is that MQA can sometimes introduce a small perplexity degradation, while GQA often preserves most of the quality of full multi-head attention while still significantly reducing KV-cache cost.

This is why GQA has become very common in production LLMs.

Sparse and sliding-window attention
#

Full attention is quadratic in sequence length:

$$ \text{cost} \approx \mathrm{O}(n^2) $$

For long contexts, this becomes expensive.

Sparse or sliding-window attention restricts which tokens can attend to which other tokens.

For example, in sliding-window attention, each token attends only to nearby tokens within a fixed window.

The trade-off:

full attention is more expressive but expensive;
local attention is cheaper but may miss long-range dependencies.

A modern compromise is to interleave local and full attention layers.

For example:

Layer 1: sliding-window attention
Layer 2: sliding-window attention
Layer 3: sliding-window attention
Layer 4: full attention
repeat

This allows most layers to be cheaper while occasional full-attention layers move global information across the sequence.

The lecture mentions this as an emerging standard trick in models such as Command A, LLaMA 4, Gemma 3/4, and OLMo 3.

A useful mental model:

local attention handles nearby syntax and local coherence;
occasional full attention handles global dependencies;
interleaving gives a practical cost/quality trade-off.

Final Takeaways
#

Modern dense LLM architectures are less chaotic than they look. Many successful models share a relatively stable recipe:

pre-norm rather than post-norm;
RMSNorm rather than LayerNorm;
no bias terms in many linear/normalisation layers;
SwiGLU or GeGLU rather than ReLU;
RoPE for position information;
$d_{\text{ff}} \approx 4 d_{\text{model}}$, or $\approx \frac{8}{3} d_{\text{model}}$ for GLU-style FFNs;
num_heads × head_dim ≈ d_model;
little or no dropout during pretraining;
weight decay mainly as an optimisation/stability tool;
GQA/MQA to reduce KV-cache cost during inference;
QK-norm, z-loss, or logit soft-capping for stability;
sparse/sliding-window attention when long context makes full attention too expensive.

The most important meta-lesson is this:

LLM architecture design is not only about mathematical expressiveness. It is also about optimisation stability, memory movement, inference latency, and hardware efficiency.

That is why small-looking choices such as RMSNorm, removing bias terms, choosing GQA, or applying QK-norm can matter. They may not change the high-level Transformer story, but they make the model easier to train, cheaper to serve, or more stable at scale.

The modern Transformer recipe #

Pre-norm vs post-norm #

LayerNorm vs RMSNorm #

Dropping bias terms #

Feed-forward activations: from ReLU to GLU #

Serial vs parallel Transformer blocks #

Position embeddings #

Sinusoidal embeddings #

Learned absolute embeddings #

Relative position embeddings #

RoPE: Rotary Position Embeddings #

Feed-forward dimension: why \(d_{\text{ff}} \approx 4 d_{\text{model}}\)? #

Attention heads and head dimension #

Deep vs wide: model aspect ratio #

Vocabulary size #

Dropout and regularization #

Stability tricks #

Output Softmax stability: z-loss #

Attention Softmax stability: QK norm #

Logit soft-capping #

Attention variants: MHA, MQA, and GQA #

Multi-Query Attention, MQA #

Grouped-Query Attention, GQA #

Sparse and sliding-window attention #

Final Takeaways #