(SPR) OSDI25 / 2502.04563 / WaferLLM

Original Repo; SPR Repo; arXiv.

Contribution of the Paper

  • The "Programming Model" of the Cerebras WSE device, PLMR:
    • P for Parallel. The WSE has far more cores than normal GPUs.
    • L for latency. The latency of transferring data between cores is heterogeneous. A simple model is that, the cores form a grid, and the latency of transferring data between two cores is proportional to the Manhattan distance between them (and also depends on the routing path).
    • M for memory. Each of the cores has its small-size local SRAM memory. And to efficiently use the memory, partition among cores must be considered carefully.
    • R for routing. The network on chip (NoC) has quite limited routing capability, which must be considered.
  • WaferLLM, which partitions or replicates the model weights in both x-axis and y-axis directions on the WSE; and the KV Cache shifting mechanism to balance the memory usage and the overhead of each cores.
  • Wafer Scale GEMM, which leverages a better "2-hop" shifting mechanism to reduce the critical path length when implementing the GEMM operation on the WSE.
  • Wafer Scale GEMV, which leverages a K-tree algorithm to reduce the AllReduce overhead when implementing the GEMV operation on the WSE.

Most Amazing Things

  • The simple but special hardware capability of the WSE device, which is modeled as a 2D grid of cores with heterogeneous latency and limited routing capability, as well as the small-size local SRAM memory.
  • The kernel optimization on such a special hardware. In the GEMM setting, we learn that the "shifting-interleaving" mechanism can be used to implement the GEMM operation with smaller critical path length on a device with heterogeneous latency. In the GEMV setting, the K-tree algorithm balances the routing overhead and the latency. Choosing \(K=2\) gets great performance. > The latency is reduced from \(O((2\alpha + \beta)N)\) (Ring Allreduce) to \(O\left(\alpha N + \frac{\sqrt[K]{N}K}{2}\beta\right)=O\left(\alpha N + \sqrt N \beta\right)\).
  • The KV Cache shifting mechanism, which leverages the locality of the heterogeneous latency.

Reproduce Trail

  • The code repository is given; however, the code is written in CSL.
  • Low value to read the code since the paper is well-written.