SGLang Meets DeepSeek
This post introduces SGLang, an inference system designed to optimize large language model (LLM) deployment. We focus on:
- SGLang’s workflow and its distribution setup (e.g., PP/TP/DP constraints);
- DeepSeek’s inference challenges on SGLang and how DP Attention optimization addresses them.
SGLang Distribution Setup (w/o DP Attention)
Overview
The SGLang1 distribution setup can be summerized as follows:
- NO PP support;
- DP>1 is not supported on multiple nodes2. In fact, DP will be deprecated in the future. SGLang suggests SGLang Router for DP (an orchestrator written in Rust);
- TP % nnodes == 0.
1 | assert ( |
Configurations
The following figures show the structure for different parallel configurations in SGLang.




DP Attention Optimization
TP in Llama 3
Before introducing the DP Attention optimization, let's first investigate how TP is implemented in Llama 3.
word_embedding
: vocab parallel;positional_embedding
: replicate;attn
: parallel on head dim;mlp
: parallel on column then row.
Llama 3 attn can be implemented as follows. For convenience, we omit
RoPE and use MHA. Notice that the size of every matrix is
reduced by tp
times. Also, the KV cache size is reduced by
tp
times.
1 | def forward(self, hidden_states): |
MLP in Llama 3 is paralleled based on column then row dispatching.
1 | self.gate_up_proj = MergedColumnParallelLinear(...) |
Problems of TP in DeepSeek
However, the TP implementation in Llama 3 is not optimal for DeepSeek. This is because
mla_attn
:Latent of size
(1, kv_lora_rank + qk_rope_head_dim)
for each token shall be saved. This cannot be divided along with the num_head dim — KV cache size cannot be reduced;Some params (e.g.
q_a_proj
,kv_a_proj_with_mqa
) cannot be paralleled — parameter duplications.1
2
3
4
5self.q_a_proj = ReplicatedLinear(...)
self.q_b_proj = ColumnParallelLinear(...)
self.kv_b_proj = ColumnParallelLinear(...)
self.o_proj = RowParallelLinear(...)
self.kv_a_proj_with_mqa = ReplicatedLinear(...)
moe_mlp
:- Expert parallel (EP) is better than TP, since each of the experts is small.
SGLang DP Attention for DeepSeek V3
DP Attention intends to solve the above problems3 in mla_attn
. The
parallel policy becomes:
word_embedding
: replicatepositional_embedding
: replicateattn
: parallel on batch size (independent requests)

Its configuration requirement is:
- 1 < AttnDP ≤ TP and TP % AttnDP = 0. This is because SGLang supports DP+TP attention.
1 | assert ( |
And some concrete configurations is shown below:


https://github.com/sgl-project/sglang, on tag v0.4.4_post4↩︎
When using DP Attention, DP has another meaning in SGLang. We use the word “AttnDP” instead of “DP” when enabling DP Attention for clarity (In fact, the canonical “DP” is always 1 in such situations). When using the word “DP”, we mean the canonical “DP” and DP Attention is disabled.↩︎