"Car Wash" test with 53 models

📝 Discussion Summary (Click to expand)

Three prevailing themes in the discussion

Theme	Key points	Representative quotes
LLMs are engineered to be polite and non‑critical	Users note that models are designed to avoid challenging assumptions, leading to “sycophancy” and a lack of true reasoning.	“I think it's related to syncophancy. LLM are trained to not question the basic assumptions being made.” – wisty “Gemini is the only AI that seems to really push back and somewhat ignores what I say.” – nomel “I think there's also an 'alignment blinkers' effect.” – HPsquared
Pattern‑matching vs. grounded reasoning (Car Wash Test)	The test exposes a gap between surface‑level pattern matching and genuine world‑model reasoning. RAG‑based summarization can “fix” the answer, but true reasoning is missing.	“The test highlights a key limitation in current AI: the difference between 'pattern matching' and 'true, grounded reasoning'.” – PaulHoule “It seems like the search ai results are generally misunderstood, I also misunderstood them for the first weeks/months.” – mlazowik
Evaluation methodology matters	Human baselines are weak (no reasoning asked), context is often omitted, and enabling reasoning in models dramatically improves performance.	“The human baseline seems flawed.” – tantalor “I asked GPT‑5.2 10x times with thinking enabled and it got it right every time.” – randomtoast “Since the conclusion is that context is important, I expected you’d redo the experiment with context.” – wrs

These themes capture the main concerns: how LLM design biases affect critical thinking, the limits of current models highlighted by the Car Wash Test, and the importance of robust evaluation protocols.

🚀 Project Ideas

Generating project ideas…

Gathering the best ideas from the HN discussion…

"Car Wash" test with 53 models

🚀 Project Ideas

Read Later