メインコンテンツへスキップ

CS336 Lecture 4 - Attention Alternatives and Mixture of Experts

·3515 文字·17 分

Please read along with the original slides

This lecture discusses two major directions for making large language models more efficient and scalable:

  1. Alternatives to standard attention, especially for long-context modelling;
  2. Mixture of Experts (MoE), which increases model capacity without increasing the activated compute proportionally.

The high-level motivation is simple: standard Transformer attention becomes expensive when the context length grows, while dense feed-forward layers become expensive when we want larger model capacity. Attention alternatives try to reduce the cost of sequence modelling, while MoE tries to decouple total parameter count from per-token FLOPs.


Part I: Attention Alternatives
#

Why do we need attention alternatives?
#

In a standard Transformer, self-attention has quadratic cost with respect to sequence length. If the sequence length is \(n\), the attention score matrix has size \(n \times n\). This becomes increasingly expensive for long-context models.

A basic engineering toolkit already exists:

  • local attention
  • global attention
  • sliding-window attention
  • FlashAttention and other systems-level optimisations
  • hybrid attention layouts

These methods are useful, but they are still relatively conservative. The lecture then moves to more radical alternatives such as linear attention, recurrent attention forms, Mamba-like state-space models, gated delta networks, and sparse attention.

Linear attention
#

Standard attention can be written as:

$$ \mathrm{Attn}(Q, K, V) = \rho(QK^\top)V $$

where:

$$ Q \in \mathbb{R}^{n \times d_k}, \quad K \in \mathbb{R}^{n \times d_k}, \quad V \in \mathbb{R}^{n \times d_v} $$

The expensive part is the computation of:

$$ QK^\top $$

which costs roughly:

$$ O(n^2 d_k) $$

and produces an \(n \times n\) matrix.

If we temporarily ignore the softmax and assume \(\rho\) is the identity function, we can reorder the computation:

$$ (QK^\top)V = Q(K^\top V) $$

This changes the computation pattern. Instead of first forming an \(n \times n\) attention matrix, we first compute:

$$ K^\top V $$

whose shape is:

$$ d_k \times d_v $$

Then we multiply \(Q\) by this smaller matrix.

The cost changes from something like:

$$ n^2 d_k + n^2 d_v $$

to:

$$ 2 n d_k d_v $$

So the complexity becomes linear in sequence length \(n\), assuming \(d_k\) and \(d_v\) are fixed.

Of course, this is not the same as standard softmax attention. The difficult part is not just the matrix reordering; the real problem is how to approximate or replace the softmax attention behaviour while keeping the computation linear.

Recurrent form of linear attention
#

Linear attention has another useful property: it can be written in a recurrent form.

Starting from:

$$ (QK^\top)V = Q(K^\top V) $$

we can define a recurrent state:

$$ S_t = S_{t-1} + k_t v_t^\top $$

and compute the output as:

$$ y_t = q_t^\top S_t $$

Here:

  • \(S_t\) is a compressed memory state;
  • each new token updates the state using \(k_t v_t^\top\);
  • the current query \(q_t\) reads from that state.

This makes linear attention look like an RNN. The model maintains a state and updates it token by token.

The nice part is the duality:

  • during training, we can use a parallel form;
  • during inference, we can use a recurrent form.

This is attractive because autoregressive inference is already sequential, so a recurrent update can be very efficient.

From linear attention to Mamba-2
#

Mamba-2 can be viewed as a generalisation of linear attention with additional gating.

Linear attention has:

$$ S_t = S_{t-1} + k_t v_t^\top $$

$$ y_t = q_t^\top S_t $$

Mamba-2 introduces a position-dependent decay or gate:

$$ S_t = \gamma_t S_{t-1} + k_t v_t^\top $$

$$ y_t = q_t^\top S_t + v_t^\top D $$

where:

$$ \gamma_t = f(x_t) $$

The gate \(\gamma_t\) allows the model to decide how much previous state should be kept. This makes the model more expressive than plain linear attention.

This is also why the lecture says “gating is good”. A simple additive state update is limited; a gated update gives the model a way to forget, preserve, or modulate information.

Gated Delta Net
#

Gated Delta Net generalises this idea further. It not only gates the previous state, but also selectively erases information from the state.

A simplified form is:

$$ S_t = \gamma_t (I - \beta_t k_t k_t^\top) S_{t-1} + \beta_t k_t v_t^\top $$

$$ y_t = q_t^\top S_t $$

where:

$$ \gamma_t = f(x_t), \quad \beta_t = f(x_t) $$

There are two key mechanisms here.

First, \(\beta_t\) controls whether the current input should be written into the state. If:

$$ \beta_t = 0 $$

then the model performs a “no input” operation.

Second, the term:

$$ I - \beta_t k_t k_t^\top $$

can erase information in the direction of the current key. This is more flexible than merely adding new information into the state.

This connects to ideas such as fast weight programming and test-time training, where model states or weights are dynamically updated based on the current input.

Hybrid architectures
#

A major theme in the lecture is that pure attention alternatives are not always used alone. Many recent models use hybrid architectures.

Examples mentioned include:

  • Minimax M1 / minimax-text-01;
  • Nemotron 3;
  • Qwen 3.5 / Qwen Next;
  • Mamba-attention hybrids;
  • Gated Delta Net / attention hybrids.

The common pattern is to use attention only in some layers, and replace the rest with more efficient recurrent or linear-time modules.

For example, Minimax M1 uses a 7-to-1 hybrid structure: seven linear attention layers and one full attention layer. Nemotron 3 uses a Mamba-attention hybrid. Qwen Next uses a Gated Delta Net / attention hybrid.

The reason hybrid designs are attractive is that full attention is still powerful. It provides direct token-token interaction, which is difficult to fully replace. But full attention is expensive, especially for long context. Hybrid models try to keep enough full attention for quality while using cheaper alternatives for scalability.

Sparse attention as another alternative
#

Another direction is sparse attention.

Instead of replacing attention with a recurrent or linear module, sparse attention keeps the attention mechanism but attends only to a subset of tokens.

The lecture discusses DeepSeek Sparse Attention (DSA). The idea is to use a lightweight indexer that selects which tokens should be attended to. This can reduce the attention cost while preserving access to important context.

A key advantage is that sparse attention can sometimes be adapted after dense short-context pretraining. That means we may not need to train an entirely new architecture from scratch.

Compared with linear attention, sparse attention is less radical. It still performs token-token attention, but only over selected tokens.


Part II: Mixture of Experts
#

What is a Mixture of Experts model?
#

A Mixture of Experts model replaces a dense feed-forward network with multiple expert feed-forward networks and a routing mechanism.

In a normal dense Transformer block, every token passes through the same MLP. In an MoE block, each token is routed to only a small number of experts.

A simplified MoE layer looks like this:

$$ h_t = \sum_{i=1}^{N} g_{i,t} \mathrm{FFN}_i(u_t) + u_t $$

where:

  • \(N\) is the number of experts;
  • \(\mathrm{FFN}_i\) is the \(i\)-th expert;
  • \(g_{i,t}\) is the gate value for token \(t\) and expert \(i\);
  • \(u_t\) is the token representation.

Usually, most \(g_{i,t}\) values are zero. Only the selected experts are activated.

The key idea is that: MoE increases the total number of parameters while keeping the activated FLOPs per token relatively small.

This is why MoE is often described as sparse activation.

Why are MoEs popular? #

MoE models are becoming popular for several reasons.

First, at the same FLOP budget, having more total parameters can improve performance. MoE allows the model to have a large number of parameters, but only activate a small subset for each token.

Second, MoEs can train faster to reach a given quality level. Some results show that MoEs achieve similar or better performance than dense models with less training compute.

Third, MoEs are naturally compatible with expert parallelism. Since different experts can be placed on different devices, MoE creates another dimension of parallelism beyond data parallelism, tensor parallelism, and pipeline parallelism.

Fourth, many high-performing open models use MoE or MoE-like designs. The lecture mentions models such as Mixtral, DBRX, Grok, Qwen MoE, DeepSeek MoE, DeepSeek V3, and Llama 4.

The practical motivation is, dense models scale compute and parameters together, while MoE partially separates them.

Why were MoEs not always popular? #

MoEs are powerful, but they are also harder to train and deploy.

The lecture highlights several issues:

  1. infrastructure complexity
  2. multi-node communication overhead
  3. routing instability
  4. heuristic training objectives
  5. load balancing problems
  6. fine-tuning overfitting
  7. additional stochasticity during serving

MoE is not just a modelling trick. It is also a systems problem.

A dense MLP is easy to execute: every token goes through the same computation. An MoE layer requires routing, dispatching tokens to experts, executing expert computation, and then combining outputs. If experts are distributed across devices, this introduces communication overhead.

So MoE only becomes truly attractive when the infrastructure can handle sparse dispatch efficiently.

What usually varies in MoE design?
#

The lecture gives three main axes of MoE design:

  1. routing function
  2. expert sizes
  3. training objectives

Most modern MoEs replace the MLP layer with an MoE layer. Less commonly, some models apply MoE to attention heads.

The most important part is usually the router.

Routing function overview
#

Routing decides which expert handles each token.

Many routing algorithms can be reduced to “choose top-\(k\)”. The lecture mentions three broad styles:

  1. token chooses expert;
  2. expert chooses token;
  3. global routing via optimisation.

The most common approach is token-choice top-\(k\) routing.

In token-choice routing, each token computes scores for all experts, then selects the top \(k\) experts.

In expert-choice routing, each expert chooses tokens.

In global routing, the model solves a matching or assignment problem.

Most practical MoE systems use token-choice routing because it is simple and scalable.

Top-\(k\) routing in detail
#

A router computes a score between token \(t\) and expert \(i\). In many models, this score comes from a simple learned projection.

A common form is:

$$ s_{i,t} = \mathrm{Softmax}(u_t^\top e_i) $$

where:

  • \(u_t\) is the token representation;
  • \(e_i\) is a learned expert embedding or router weight;
  • \(s_{i,t}\) is the routing score.

Then the gate is:

$$ g_{i,t} = \begin{cases} s_{i,t}, & s_{i,t} \in \mathrm{TopK}({s_{j,t} \mid 1 \leq j \leq N}, K) \\ 0, & \text{otherwise} \end{cases} $$

This means only the selected experts receive the token.

In practice, different models vary in the details:

  • Switch Transformer uses \(k=1\);
  • GShard uses \(k=2\);
  • Mixtral uses \(k=2\);
  • Grok uses \(k=2\);
  • Qwen uses \(k=4\);
  • DBRX uses \(k=4\);
  • DeepSeek uses larger \(k\), such as 6, 7, or 8 depending on the version.

Some models apply softmax before selecting top-\(k\). Others select top-\(k\) first and then renormalise the selected scores.

Routing is similar to a small classifier
#

Conceptually, the router is like a small classifier or logistic regressor placed before the experts.

For each token, it predicts which experts are most suitable. The experts are then applied sparsely.

This is why MoE can feel like putting a small “expert selection” network before the FFN.

However, it is not the same as a normal dense layer. A dense layer would mix all experts for every token, which would destroy the compute advantage. MoE only activates a small subset.

So the router is not just adding capacity; it is deciding which capacity to activate.

Other routing methods
#

The lecture also mentions less common routing methods.

One is reinforcement learning. Early MoE work sometimes used RL to learn routing policies. In principle, this makes sense because routing involves discrete decisions. However, RL methods such as REINFORCE have high gradient variance and add complexity. They work, but not clearly enough to dominate practical systems.

Another method is linear assignment. This formulates routing as an optimisation or matching problem. It can produce more globally balanced routing decisions, but it is more complex and less commonly used in large-scale practical MoEs.

Fine-grained experts and shared experts
#

Recent MoE models often use many smaller experts instead of fewer large experts.

DeepSeek-style models use fine-grained expert segmentation. The idea is to split experts into smaller pieces and activate more of them.

Some models also use shared experts. A shared expert is always active, while routed experts are selected by the router.

The motivation is:

  • routed experts provide specialisation;
  • shared experts provide common capacity;
  • fine-grained experts allow more flexible combinations.

DeepSeek and Qwen use shared experts and fine-grained routed experts. OlMoE ablations show gains from fine-grained experts, but not necessarily from shared experts. So this design choice is useful, but its value may depend on model scale, training setup, and implementation details.

Training MoEs: the core difficulty
#

The main training problem is, sparse routing decisions are not differentiable.

We want sparsity because it gives efficiency. But discrete top-\(k\) selection creates optimisation difficulties.

The lecture lists three solutions:

  1. reinforcement learning;
  2. stochastic perturbations;
  3. heuristic balancing losses.

In practice, modern MoEs mostly use heuristic balancing losses, sometimes combined with noise or other stabilisation tricks.

Stochastic routing approximations
#

Early MoE work added stochasticity to routing.

For example, Shazeer et al. used Gaussian perturbations. The router adds noise before selecting experts. This has two effects:

  1. experts become less brittle;
  2. the model learns a ranking over experts rather than relying on deterministic hard choices too early.

Switch Transformer also used stochastic jitter, which applies a uniform multiplicative perturbation to the router input or logits. The goal is similar: encourage exploration and reduce expert collapse.

However, later work removed some of these tricks when they were not consistently helpful.

Load balancing losses
#

A major issue in MoE training is expert imbalance.

If the router sends too many tokens to a small number of experts, those experts become overloaded while others are underused. This is bad for both learning and systems efficiency.

Switch Transformer uses an auxiliary load balancing loss. A simplified version is:

$$ L_{\mathrm{aux}} = \alpha N \sum_{i=1}^{N} f_i P_i $$

where:

  • \(N\) is the number of experts;
  • \(f_i\) is the fraction of tokens routed to expert \(i\);
  • \(P_i\) is the fraction of router probability assigned to expert \(i\);
  • \(\alpha\) controls the strength of the auxiliary loss.

The intuition is that, if an expert is used too frequently, the balancing loss pushes the router away from it.

This prevents expert collapse and improves hardware utilisation.

Per-expert and per-device balancing
#

DeepSeek v1/v2 uses balancing objectives at different levels.

Per-expert balancing is similar to Switch Transformer. It encourages tokens to be distributed evenly across experts.

Per-device balancing aggregates experts by device. This matters because in distributed MoE training, imbalance is not only about experts. It is also about devices.

If one GPU receives too many tokens because it hosts popular experts, the whole system slows down.

So the balancing objective must consider:

  • expert-level load;
  • device-level load;
  • communication load.

This is where MoE becomes deeply connected to systems design.

DeepSeek V3: auxiliary-loss-free balancing
#

DeepSeek V3 introduces a variation based on per-expert biases.

The router score is adjusted with a learned or dynamically updated bias:

$$ s’{i,t} = s{i,t} + b_i $$

Then top-\(k\) routing is performed using the biased score.

The bias makes underused experts more likely to receive tokens and overused experts less likely to receive tokens. DeepSeek calls this “auxiliary-loss-free balancing”.

However, the lecture notes that this is not fully auxiliary-loss-free, because DeepSeek V3 still uses a complementary sequence-wise auxiliary loss.

So the better interpretation is, DeepSeek V3 reduces reliance on the traditional expert-level auxiliary loss, but does not remove balancing objectives completely.

What happens without load balancing?
#

If load balancing is removed, routing can become highly imbalanced.

Some experts may receive almost all tokens, while others receive very few or none. This harms both quality and efficiency.

The lecture shows that removing load balancing can cause unstable expert usage patterns. The training loss might still go down, but the model becomes less efficient and less robust.

Systems side of MoE training
#

MoE enables expert parallelism.

In a dense model, the MLP is replicated or partitioned in relatively standard ways. In an MoE model, different experts can live on different devices. Tokens are routed across devices to the experts they select.

This creates new forms of parallelism:

  • data parallelism;
  • tensor/model parallelism;
  • expert parallelism;
  • combinations of expert, model, and data parallelism.

However, this also creates communication overhead. Token dispatch and combine operations can become bottlenecks.

Modern MoE systems use optimised sparse matrix multiplication and token dispatch. The lecture mentions MegaBlocks as an example of a library that uses smarter sparse matrix multiplication for MoE execution.

Whether this is beneficial depends on hardware, batch size, routing balance, implementation quality, and interconnect bandwidth.

MoE and communication reduction
#

Some recent architectures modify the MoE design to reduce communication.

Nemotron 3 introduces a design where activations are down-projected before expert dispatch. The idea is to reduce the amount of data that must be communicated between devices.

This reflects an important systems principle: in distributed MoE, routing quality is not enough; communication volume also matters.

If the model sends high-dimensional activations across devices for every routed token, communication can dominate. Reducing activation size before dispatch is one way to improve efficiency.

Stochasticity of MoE models during serving
#

MoE models can have additional stochasticity beyond ordinary dense models.

One reason is token dropping. If an expert has limited capacity, and too many tokens are routed to it in the same batch, some tokens may be dropped or rerouted.

This can happen at the batch level. Therefore, another user’s query in the same batch can affect whether your token gets processed by its selected expert.

This is not typical for dense models, where each request is usually independent except for low-level numerical nondeterminism.

Router stability and z-loss
#

MoE routers can be numerically unstable.

The router often uses softmax over expert logits. Small numerical differences in logits can lead to different expert choices, especially when logits are close. This is dangerous because routing is discrete.

One solution is to compute the router in float32, even if the rest of the model uses lower precision.

Another stabilisation method is router z-loss. A simplified form is:

$$ L_z = \frac{1}{B} \sum_{i=1}^{B} \left( \log \sum_{j=1}^{N} e^{x_{ij}} \right)^2 $$

The z-loss penalises large router logits and helps keep routing numerically stable.

The intuition is that, router logits should not grow too large, because overconfident routing can become unstable.

MoE fine-tuning issues
#

Sparse MoEs can overfit during fine-tuning, especially when the fine-tuning dataset is small.

One reason is that only a subset of experts receives updates for a given batch. Some experts may be undertrained or over-specialised.

The lecture mentions two approaches:

  • Zoph et al.: fine-tune non-MoE MLPs;
  • DeepSeek: use a large amount of supervised fine-tuning data, such as 1.4M SFT examples.

Upcycling dense models into MoE models
#

Upcycling asks:

Can we initialise an MoE model from a pretrained dense language model?

Instead of training an MoE from scratch, we copy or transform parts of a dense model into multiple experts.

For example, MiniCPM-MoE uses a MiniCPM base model and turns it into an MoE with top-\(k=2\), 8 experts, and around 4B active parameters.

Qwen MoE is another example. It is initialised from Qwen 1.8B and uses top-\(k=4\), 60 experts, and 4 shared experts.

The motivation is obvious:

  • dense pretraining is expensive;
  • MoE training is complex;
  • upcycling may reuse existing dense models and reduce cost.

The key question is whether copied experts can specialise after continued training.

DeepSeek MoE evolution
#

The lecture ends by walking through DeepSeek MoE architectures.

DeepSeek MoE V1
#

DeepSeek MoE V1 uses:

  • 16B total parameters;
  • 2.8B active parameters;
  • standard top-\(k\) routing;
  • 2 shared experts;
  • fine-grained experts;
  • standard auxiliary-loss balancing at expert and device levels.

DeepSeek MoE V2
#

DeepSeek MoE V2 scales up to:

  • 236B total parameters;
  • 21B active parameters;
  • 2 shared experts;
  • 160 fine-grained experts;
  • 6 active routed experts;
  • top-\(M\) device routing;
  • communication balancing loss.

The additional device routing and communication balancing reflect the fact that MoE at this scale is heavily constrained by distributed systems efficiency.

DeepSeek MoE V3
#

DeepSeek V3 scales further to:

  • 671B total parameters;
  • 37B active parameters;
  • 1 shared expert;
  • 258 fine-grained experts;
  • 8 active routed experts;
  • sigmoid + softmax top-\(k\) routing;
  • top-\(M\) device routing;
  • auxiliary-loss-free style balancing;
  • complementary sequence-wise auxiliary loss.

The architecture is not just about adding experts. It also adds mechanisms for balancing, communication control, and routing stability.


My understanding
#

The lecture connects two ideas that initially look separate: attention alternatives and MoE.

Attention alternatives deal with the sequence-length problem. And MoE deals with the model-capacity problem. Both directions use sparsity or structured computation:

  • sparse attention selects tokens;
  • linear attention compresses history into a state;
  • Mamba-like models use gated recurrent states;
  • MoE selects experts.

The common theme is that, modern LLM scaling is no longer just “make everything dense and bigger”. It is increasingly about choosing which computation should be activated.

This also explains why systems engineering becomes central. Sparse or conditional computation sounds efficient in theory, but the real benefit depends on implementation. Routing, communication, batching, memory layout, and numerical stability all matter.

For MoE in particular, the model is only half of the story. The other half is whether the system can route and execute tokens efficiently.