Project ideas from Hacker News discussions.

iPhone 17 Pro Demonstrated Running a 400B LLM

📝 Discussion Summary (Click to expand)

Theme 1 – Hardwarebreakthroughs
The recent iPhone‑level chips can actually host a 400 B MoE model, something many thought impossible a year ago.

"A year ago this would have been considered impossible. The hardware is moving faster than anyone's software assumptions." — ashwinnair99

Theme 2 – Software innovation
Running such a model relies on clever engineering – MoE routing, flash‑attention, KV‑cache streaming, and on‑device quantization rather than special ASICs.

"This isn't a hardware feat, this is a software triumph. They crafted a large model so that it could run on consumer hardware (a phone)." — cogman10

Theme 3 – Speed and practicality concerns
Even when it works, the throughput is far from interactive; users call it “objectively slow” and note the 100× slowdown compared with server‑grade latency.

"It is objectively slow at around 100× slower than what most people consider usable." — Terretta

Theme 4 – Future implications & edge‑AI trends
Commentators see this as a stepping stone toward ubiquitous on‑device AI, but they stress that true viability will require lighter models, better RAM, or new silicon, not just bigger phones.

"I think the future is the model becoming lighter not the hardware becoming heavier." — RALaBarge


🚀 Project Ideas

[FlashMoE StreamingEngine]

Summary

  • Enables on-device inference of 300B+ parameter models by streaming quantized weights directly from flash storage with OS-level caching.
  • Delivers up to 2× faster first‑token latency on low‑end hardware, turning a demo into usable interactive AI.

Details

Key Value
Target Audience Developers & hobbyists building local LLMs on laptops, phones, or edge devices
Core Feature Real‑time mmap‑style streaming of expert layers from SSD/NVMe with smart preloading and eviction
Tech Stack Rust + mmap; bindings for llama.cpp, MLX, HuggingFace Transformers; optional C++ core
Difficulty Medium
Monetization Revenue-ready: Subscription

Notes

  • Why HN commenters would love it: they repeatedly praised SSD streaming and “Trust the OS” caching as the breakthrough that makes large models practical.
  • Potential: Makes interactive inference on edge hardware viable, addressing the exact pain point of running 400B MoE models on phones.

[AutoMoERouter CLI]

Summary

  • Takes a full‑size MoE checkpoint and auto‑generates a routing profile that only loads the experts actually needed per token.
  • Cuts RAM usage by 40‑60 % and speeds up inference on devices with tight memory limits.

Details

Key Value
Target Audience Researchers and engineers experimenting with large MoE models on consumer hardware
Core Feature Automatic expert‑layer profiling, per‑token routing file generation, and optional quantization mapping
Tech Stack Python + NumPy; CLI wrapper that outputs JSON routing maps compatible with llama.cpp and MLX
Difficulty Low
Monetization Hobby

Notes- Why HN commenters would love it: the discussion highlighted how “making expert selection more predictable also means making it less effective” and the need for smarter routing to save RAM.

  • Potential: Lowers the barrier for HN users to experiment with MoE demos without constantly swapping experts, fostering deeper community exploration.

[EdgeInfer Scheduler (SaaS)]

Summary

  • Provides a cloud‑based scheduler that runs heavyweight LLM inference on remote GPUs while streaming results back to the user’s device, enabling interactive latency on modest hardware.
  • Offers a fallback “local‑slow” mode that runs the same model locally when connectivity is unavailable.

Details

Key Value
Target Audience End‑users & developers who want interactive AI on modest devices without buying expensive hardware
Core Feature Task queue, model streaming, context caching, and auto‑scaling compute pools for on‑demand inference
Tech Stack Node.js backend, WebSockets, Docker/Kubernetes, Redis, gRPC for model serving, with gRPC‑streaming API
Difficulty High
Monetization Revenue-ready: Tiered subscription (e.g., $9/mo basic, $49/mo pro)

Notes

  • Why HN commenters would love it: they expressed frustration about needing 12 GB RAM to run a MoE model and the desire for “real improvement will be when software engineers get into the training loop”.
  • Potential: Creates a marketplace for edge‑cloud AI, opening discussion on privacy, cost, and the feasibility of “on‑device” vs “cloud‑assisted” inference among HN readers.

[PrefetchPrompt UI]

Summary

  • A developer‑focused SDK that adds learned KV‑cache prefetching to chatbot front‑ends, predicting the next expert layers to keep them resident in RAM.
  • Cuts first‑token latency from ~2 seconds to <0.5 seconds on devices with 8 GB RAM.

Details

Key Value
Target Audience Mobile app developers building AI‑enhanced chat experiences
Core Feature On‑device inference engine integration with a TensorFlow Lite routing predictor that preloads likely experts
Tech Stack Swift/Kotlin SDK, TensorFlow Lite model for routing, Rust inference core, optional cloud sync service
Difficulty Medium
Monetization Revenue-ready: License per app (e.g., $499 commercial license)

Notes

  • Why HN commenters would love it: the thread frequently mentioned “learned prefetching for what the next experts will be” and the desire to minimize expert‑layer sw

Read Later