1. “Did the models actually get better?”
Users are split between anecdotal evidence of a jump (especially around Opus 4.5/4.6) and the METR data that shows a flat or even declining trend when all models are lumped together.
“I’m really seeing it as well.” – postflopclarity
“The study shows a flat line… if you exclude GPT‑5 it matches a logistic curve.” – wongarsu
2. The harness is the real game‑changer
Many respondents argue that the leap in productivity comes from better tooling, agentic loops, and context‑management, not from the core model’s reasoning.
“We’ve gotten better in harnessing not the models’ actual reasoning.” – jwpapi
“The practical capability jump is huge when you combine models with tool use, planning loops, and persistent context.” – idorozin
3. Trust, reliability, and accountability remain hard problems
Even with newer models, users still need to review, fix, and sometimes blame the output. The lack of a clear owner for errors fuels skepticism.
“The issue with LLM’s is trust… who is accountable?” – jygg4
“We need to convince customers that we have the right technology… accountability is not easy.” – marcuschong
These three themes—debate over true model progress, the primacy of tooling, and ongoing trust/accountability concerns—dominate the discussion.