I built a calculator to plot KV cache sizes for common open source models. As we push massive context lengths into models, it's useful to remember the physics of KV cache and its linear memory complexity. It's surprisingly easy for inference costs to be dominated by KV cache!
Some loose thoughts I had while playing with the tool:
MLA vs. GQA
Deepseek multi-head latent attention (MLA) is a cool technique to reduce KV storage. It mitigates the accuracy loss from more straightforward ablations like grouped-query attention (GQA), which just drops KV heads outright and hurts accuracy according to Deepseek. But gpt-oss somehow squeezes GQA into a small config (head_dim=64, 8 KV heads); gpt-oss layers store 50% less information per token than Deepseek and don't seem to suffer for it. GQA is also easier to parallelize than MLA; we shard over the KV heads dimension.
With KV, more is different
Storing big KV caches, say >200k tokens, gets hairy. It significantly reduces our ability to batch requests (since each worker allocates so much of its memory to request-specific KV cache), and can dominate the overall memory traffic of a model forward pass. No wonder providers charge extra for tokens beyond this limit.
Information Density in LLMs
There is great progress on compression of model weights, like MXFP4 used by gpt-oss. Yet KV cache is almost always left in BF16. Open models historically struggled with long context effectiveness; now that R1 and gpt-oss work, long-context optimizations may follow. MXFP8 is promising as a drop-in replacement for BF16 in many cases.