1.Predictive delta compression enables larger effective context
Storing only the differences (deltas) between the real and predicted KV cache lets the model reuse a tiny buffer instead of keeping the full cache, making it possible to fit huge contexts (e.g., 200‑250 k tokens) on a 24 GB GPU alongside a 27 B model.
"Tiny deterministic model predicts the K/V cache, prediction is compared with reality, delta is stored in vram. … predicts the values again, applies the delta, and you have the full correct value while just storing the delta" — porridgeraisin
"The tradeoff gets better the bigger your primary model … The KV cache can consume a lot of expensive VRAM" — wongarsu
2. Practical limits and skepticism about compression benefits
Although the idea sounds attractive, recomputing the entire KV cache for each step is still quadratic in context length and often outweighs the VRAM savings, making the approach worthwhile only for very large models or high‑throughput serving.
"You can use the original model to compress the kv cache and get ∞x compression, since the prediction is perfect. The cost is time, and I don't see how this could be worth it." — 0-_-0
"Even recomputing a 'draft' of the KV cache is still quadratic in context length… that's even worse." — zozbot234
3. Viewing speculation as a first‑class model primitive
Discussion shifts from a one‑off hack to treating speculative prediction as a fundamental inference primitive that can be recursed on, distilled, or aligned architecturally for even greater efficiency.
"If “speculative” approach works so well in different contexts why not make it first class and use everywhere, possibly recursively?" — mirekrusin
"Speculation is only worth it if you can profit from it. Not every context allows this or has a similar idea of what can be speculated." — saagarjha