1. Localization/ Translation UX Friction
"I don't understand these websites which force translation to my native language... where is the button for disabling it?" – w4yai > "‘codage de pointe’ sounds so weird and cringe in French." – w4yai
2. Benchmark Contamination & Moving Goalposts
"This feels very much like 'we are now moving the goal posts'." – 1a527dd5
"We also found evidence that models that have seen the problems during training are more likely to succeed, because they have additional information needed to pass the underspecified tests." – embedding‑shape
3. Benchmark Saturation & Goodhart’s Law
"Benchmarks essentially aren't, for practical concerns anyways." – embedding‑shape
"Is this saying a quarter of the questions and answers were wrong, this whole time?!" – vintagedave*
4. Push for Private / Custom Evaluations
"Spend a hour or an afternoon creating your own eval harness with problems or workloads from your private repos or personal projects." – dannyw
"I made Zork bench… it’s deterministic. Therefore it should be easy for an LLM to play and complete. Yet they don’t." – mnky9800n