Top 4 Themes from the discussion
| Theme | Summary | Illustrative quote |
|---|---|---|
| 1. Google’s limited cloud push for Gemma 4 – Many users are confused why Google isn’t offering a paid, hosted inference service for Gemma 4 despite its strong open‑source release. | “If you were to believe a lot of metrics Gemma 31B it’s much better than flash lite… I should be able to pay Google to use it and that should be at least a secretary, called action how I can do that but it’s missing from both the blog post entirely.” — mchusma | |
| 2. Pricing & cost‑effectiveness of hosting small models – Hosting even a 27B‑parameter model can be expensive, and some argue it may not be worth Google’s effort to provide a commercial inference stack. | “If it helps, I mean it in a really literal sense… qwen3.6 27b is currently $3.20 per million tokens on openrouter … there’s no reason to spend money on it.” — Farmadupe | |
| 3. Multi‑Token Prediction (MTP) / speculative decoding speed gains – The new “assistant” draft models enable dramatic throughput improvements, often 2‑3× faster with little quality loss. | “Multi token prediction is the same thing as speculative decoding… these small assistant models are specialised for the task and are much faster than general‑purpose drafts.” — adrian_b | |
| 4. Model trade‑offs: Qwen vs. Gemma for tool‑calling & speed – Qwen still leads on tool‑calling reliability, while Gemma 4 offers higher speed on suitable hardware, but each has its niche. | “I find qwen unbeatable for tool‑calling… the speed was one of the reasons why I was running qwen and not gemma.” — disiplus |
All quotations are presented exactly as posted, with HTML entities corrected and attribution preserved.