Top 4 Themes from the discussion
| Theme | Key Insight | Representative Quote |
|---|---|---|
| Local inference on vintage hardware | A 26 B Gemma‑4 model can be run on a recycled Xeon E5‑2620 v4 with DDR3 and 128 GB RAM, delivering usable speeds despite the age of the platform. | “Gives: … 11.94 tokens per second while it’s also being a binary cache and CI builder” – cafkafk |
| Speed expectations & practical limits | Community members flag the modest throughput (≈12 tps) as insufficient for interactive workloads, though it can handle batch‑oriented tasks. | “20 tokens per second for eval time is the killer here. It means you can’t use this to process any meaningful amount of text.” – ekianjo |
| Speculative decoding & optimizer knobs | Techniques such as MTP drafts, --cpu-moe, and appropriate thread counts unlock extra performance, but require careful tuning and realistic prompt sizes. |
“From the prompt timings above, it seems like ‘prompt eval time’ is the equivalent to ‘processing time for input tokens’.” – Majromax |
| Energy, cost, and practicality concerns | Running old servers consumes noticeable power, making the economics questionable compared to cloud subscriptions, yet they remain attractive for hobbyist or low‑budget deployments. | “If you’ve got something consuming 100 watts average over your 24‑hour period, and your electricity costs 20 cents per kWh, you’re already spending almost as much as a Claude subscription.” – dangus |
All quotations are presented verbatim (HTML entities fixed) and attributed to the original HN users.