I failed to recreate the 1996 Space Jam website with Claude

📝 Discussion Summary (Click to expand)

The discussion revolves around using Large Language Models (LLMs) for complex generative tasks, particularly recreating visual web designs from screenshots. Three prevalent themes emerged:

1. LLMs Struggle with Precise Spatial and Geometrical Reasoning

Many users noted the difficulty LLMs have in handling tasks that require accurate spatial arrangement, precise pixel measurement, or geometric reconstruction, even when using multimodal inputs.

Supporting Quote: One user summarized the core issue well: "LLMs don't have precise geometrical reasoning from images. Having an intuition of how the models work is actually.a defining skill in 'prompt engineering'" (mcbuilder).
Supporting Quote: Another noted the failure mode in a related context: "Try getting a chatbot to make an ascii-art circle with a specific radius and you'll see what I mean." (dcanelhas).

2. LLM Infallibility and User Trust are Major Concerns

A significant portion of the conversation focused on the inherent overconfidence of LLMs and the difficulty users face in reliably vetting the output, especially for subtle errors that a junior developer might miss.

Supporting Quote: A participant observed the general tendency: "All AI's are overconfident. It's impressive what they can do, but it is at the same time extremely unimpressive what they can't do while passing it off as the best thing since sliced bread." (jacquesm).
Supporting Quote: The risk inherent in untrustworthy output was highlighted: "what if the LLM gets something wrong that the operator (a junior dev perhaps) doesn't even know it's wrong? that's the main issue: if it fails here, it will fail with other things, in not such obvious ways." (GeoAtreides).

3. Iterative Prompting and Tool Use Expected Over "One-Shot" Success

The discussion suggested that achieving good results often requires moving beyond simple, single-prompt requests toward guided, multi-step processes that involve the LLM writing its own testing or analysis tools.

Supporting Quote: One suggested approach to improve reliability involved mandating self-correction: "The right way to handle this is not to build it grids and whatnot... but to instruct it to build image processing tools of its own and to mandate their use in constructing the coordinates required..." (fnordpiglet).
Supporting Quote: Another user cautioned against judging capability based on initial attempts: "It's not fair to judge Claude based on a one shot like this... Maybe on try three it totally nails it." (thecr0w).

🚀 Project Ideas

Iterative Visual Grounding Agent (IVGA)

Summary

A tool designed to overcome LLMs' poor inherent spatial reasoning when interpreting images, by allowing the LLM to programmatically interact with and measure the source image based on expert suggestions from the HN thread.
Provides robust, measurable coordinates for visual elements, turning a "one-shot inference" problem into an iterative, verifiable coding task.

Details

Key	Value
Target Audience	Developers and prompt engineers pushing the limits of multimodal LLMs for design tasks.
Core Feature	An agentic framework where the LLM writes code (e.g., using PIL, OpenCV, or similar libraries) to measure, analyze, and iterate on coordinates based on feedback derived from the original screenshot.
Tech Stack	Python, LangChain/LlamaIndex for agent orchestration, Pillow (PIL) or OpenCV for image processing, and selection of state-of-the-art VLM (e.g., Qwen-VL or a future Claude/Gemini release) for initial semantic grounding.
Difficulty	High
Monetization	Hobby

Notes

Users highlighted the need for the LLM to "build image processing tools of its own and to mandate their use in constructing the coordinates" (fnordpiglet).
The project directly addresses the difficulty LLMs have with precise geometry and measurement, which even simple ASCII art exposes (dcanelhas). This shifts the burden from "seeing" pixels to "calculating" based on coded tools.

Historical Web Standard LLM Sandbox

Summary

A controlled, curated environment that allows users to query LLMs specifically about outdated or competing web standards (like CSS1 alternatives mentioned in the discussion) and immediately test the generated code against a known-good reference implementation.
Solves the problem of foundational knowledge gaps in LLMs due to SEO/contemporary relevance filtering in training data.

Details

Key	Value
Target Audience	Web historians, nostalgia enthusiasts, and developers wanting to understand pre-modern web constraints for historical accuracy or benchmarking.
Core Feature	A web sandbox that accepts a prompt specifying year/standard (e.g., "HTML 3.2, 1997") and renders the LLM's output into an isolated environment verified against historical parser behavior.
Tech Stack	Frontend: Vanilla JS/HTML/CSS (for rendering), Backend: Node.js/Python for managing sandbox instances, LLM API access with system prompts locking the language generation scope (e.g., "You are only allowed to use TABLE layout for positioning").
Difficulty	Medium
Monetization	Hobby

Notes

Directly addresses the frustration that LLMs lack knowledge of non-dominant specs because of "no SEO value" in modern scrapes (lagniappe).
Users expressed a desire to "Just spec the year and target browser and target standard" for better results (mxfh). This tool literalizes that request into a productive workflow.

Unobvious Failure Mode Detector (UFMD)

Summary

A service that focuses on auditing LLM-generated code not for syntactic correctness or simple functional tests, but specifically for subtle, hard-to-detect spatial, alignment, or subtle mathematical errors that humans might miss ("unknown unknowns").
This tool aims to make LLM failures "obvious," addressing the core UX critique that current LLMs hide their mistakes well.

Details

Key	Value
Target Audience	Engineers, managers, and QA specialists who are wary of adopting LLM-generated code due to trust concerns over non-obvious errors (`godelski`, `a4isms`).
Core Feature	Post-generation analysis module capable of generating programmatic difference metrics (e.g., coordinate deviation heatmaps, visual diffs against a reference image, or numerical deviation reports from intended logic) that exceed a configurable tolerance threshold (e.g., 99.9% perfection).
Tech Stack	Python/ImageMagick or OpenCV for robust pixel/vector comparison, Integration with LLM/Agent framework to feed output back into the measurement loop (`sdenton4`, `sigseg1v`), and a clear, highly readable visual reporting dashboard.
Difficulty	High
Monetization	Hobby

Notes

Targets the central issue: "It is hard to figure out when they're wrong" (godelski) and that good tool design "makes its failure modes obvious" (godelski).
Provides the automated checks that users wished the original experiment had, creating a critical layer of objective scrutiny above simple compilation or basic unit tests.