1. Local LLMsrun efficiently on Apple‑silicon with massive RAM
"I got this running on a 128GB M5 the other day – pretty painless, model runs in about 80GB of RAM and it seemed to be very capable at writing code and tool execution." – simonw
2. Advanced reasoning / tool‑calling in models like DeepSeek‑V4 Flash is impressive, but large contexts can hurt prefill speed
"I don't want to be a jerk but 31t/s prefill is basically unusable in an agentic situation. A mere 10k in context and you're sitting there for 5+ minutes before the first token is generated." – xinence
3. Cost‑performance and the “good enough” local model threshold
"An M5 Max MBP with 128G of RAM costs ~$5k. An Nvidia RTX 5090 with 32G RAM is $4‑5k, and RTX PRO 6000 with 96GB RAM $10k. Do you have any data on which is the best price/performance for local inference?" – kamranjon