(SPR) OSDI25 / 2502.04563 / WaferLLM

Posted on 2025-07-08 In SPR , OSDI25

Contribution of the Paper

The "Programming Model" of the Cerebras WSE device, PLMR:
- P for Parallel. The WSE has far more cores than normal GPUs.
- L for latency. The latency of transferring data between cores is heterogeneous. A simple model is that, the cores form a grid, and the latency of transferring data between two cores is proportional to the Manhattan distance between them (and also depends on the routing path).
- M for memory. Each of the cores has its small-size local SRAM memory. And to efficiently use the memory, partition among cores must be considered carefully.
- R for routing. The network on chip (NoC) has quite limited routing capability, which must be considered.
WaferLLM, which partitions or replicates the model weights in both x-axis and y-axis directions on the WSE; and the KV Cache shifting mechanism to balance the memory usage and the overhead of each cores.
Wafer Scale GEMM, which leverages a better "2-hop" shifting mechanism to reduce the critical path length when implementing the GEMM operation on the WSE.
Wafer Scale GEMV, which leverages a K-tree algorithm to reduce the AllReduce overhead when implementing the GEMV operation on the WSE.

The simple but special hardware capability of the WSE device, which is modeled as a 2D grid of cores with heterogeneous latency and limited routing capability, as well as the small-size local SRAM memory.
The kernel optimization on such a special hardware. In the GEMM setting, we learn that the "shifting-interleaving" mechanism can be used to implement the GEMM operation with smaller critical path length on a device with heterogeneous latency. In the GEMV setting, the K-tree algorithm balances the routing overhead and the latency. Choosing \(K=2\) gets great performance. > The latency is reduced from \(O((2\alpha + \beta)N)\) (Ring Allreduce) to \(O\left(\alpha N + \frac{\sqrt[K]{N}K}{2}\beta\right)=O\left(\alpha N + \sqrt N \beta\right)\).
The KV Cache shifting mechanism, which leverages the locality of the heterogeneous latency.