Key Themes from the discussion
-
Local LLM performance on consumer hardware
"I have a 2070 and can confirm it works amazingly fast." – ganelonhb > "I get 15‑20 tokens/sec out of that model with zero finagling or tweaking." – macNchz
-
Technical debate around the GGUF format and model architecture > "I regret that the projection models ended up separate, and I too would have preferred for them to be in a single file." – Philpax
"AFAIK[0] they are (usually) so‑called 'special' tokens" – badsectoracula
-
Community sentiment on model publishing and quantizers > "TBF Unsloth are doing a fantastic job of publishing quants of the major models quickly – they just don't have nearly the volume of 'weird' models as TheBloke did." – bashbjorn
"7b mistral is quite outdated. On a 12 GB 4070 you can run qwen 3.5 9b q4km or qwen 3.6 35b, the latter will be a lot smarter but also a lot slower due to ram offload." – mixtureoftakes