28M Hacker News comments as vector embedding search dataset

📝 Discussion Summary (Click to expand)

Here are the three most prevalent themes from the Hacker News discussion:

1. Permancence of Online Content and the Desire for Deletion

The discussion is heavily centered on the lack of an easy account/comment deletion mechanism on Hacker News and the realization that published content is practically permanent due to data hoarding and replication.

Supporting Quote: "The words we type on this site diffuse rapidly onto innumerable independent devices where they are experimentation grist for herds of wild nerds around the globe. Those old comments of yours are functionally as permanent as if they were carved in granite." (delichon)
Supporting Quote: "If you request deletion of your Hacker News account, note that we reserve the right to refuse to (i) delete any of the submissions, favorites, or comments you posted on the Hacker News site or linked in your profile and/or (ii) remove their association with your Hacker News ID." (echelon)

2. The Use of User Content in Training AI/LLMs

A major undercurrent is the concern that user comments, posted without explicit consent for large-scale commercial use, are being ingested into massive AI training datasets and vector databases, often by companies affiliated with Y Combinator itself.

Supporting Quote: "It's also likely they've been used to train AI models." (qsort)
Supporting Quote: "I did not give so much to the public internet for the benefit of commercial AI models, simple as that. This breaks the relationship I had with the public internet, and like many others I will change my behaviour online to suit." (ehnto)

3. Questioning Platform Licensing and User Agency (GDPR/Legality)

Users debated the legal standing of HN's Terms of Service, particularly concerning non-commercial use restrictions they place on users, when YC/affiliated entities clearly benefit commercially from the accumulated data, and the status of privacy laws like GDPR.

Supporting Quote: "Corporations have an unlimited right to bully and threaten to take down embarrassing content... but then if individuals do a much less egregious thing to try and take down their content they don’t even get paid for it’s immoral." (dangus)
Supporting Quote: "The law? I don't know, copyright law I guess? ... By licensing according to the terms of service, which is a binding contract, you are relinquishing those rights." (otterley answering GeoAtreides)

🚀 Project Ideas

Account Data Purge Service (ADPS)

Summary

A service that provides users a one-click way to find, contact, and request deletion from third-party archives of their public content (like HN comments), directly addressing the permanence and lack of control users feel over their data.
Core Value Proposition: Regaining digital autonomy and providing a feasible (if not guaranteed) path to content removal from derived datasets for privacy-conscious users.

Details

Key	Value
Target Audience	Privacy-conscious HN users expressing regret over content permanence and lack of platform deletion tools.
Core Feature	Automated generation of deletion request emails/API calls targeting known public HN archives (e.g., those hosted on Hugging Face, BigQuery, or personal sites like `hn.fiodorov.es`), coupled with a system to track the status of these requests.
Tech Stack	Backend: Python (Scrapy/BeautifulSoup for minor site mapping, libraries for email automation). Frontend: Simple web interface. Integration: Mechanisms to interact with known archive maintainers (if they offer contact endpoints).
Difficulty	Medium
Monetization	Hobby

Notes

Why HN commenters would love it: Directly addresses the frustration expressed by j4coh ("Oh to have had a delete account/comments option") and the legal/moral concerns raised by GeoAtreides and dangus regarding unauthorized use by third parties.
Potential for discussion or practical utility: This tool operationalizes the "Right to be Forgotten" for scraped public data, sparking discussion on data control vs. archival necessity.

Semantic Thread Association Tool (STAT)

Summary

A tool that uses advanced embedding models (like those mentioned by minimaxir and xfalcox) to map new comments/posts to semantically similar historical threads, regardless of exact keyword matches.
Core Value Proposition: Preventing redundant discussions and providing immediate historical context before a user submits a reply, combating intellectual labor repetition.

Details

Key	Value
Target Audience	Active HN commenters (adverbly, delichon) who want to see if their intended response or the topic itself has been hashed out before.
Core Feature	A browser extension that takes the text the user is typing (or the current post's thread content) and queries a pre-computed index of all past HN comment threads/links (using high-quality vectors like EmbeddingGemma or bge-m3) to surface "Similar Threads."
Tech Stack	Backend: Vector Database (e.g., ClickHouse as mentioned by gkbrk, or Pinecone/Weaviate). Model Inference: Use of state-of-the-art open embeddings (as discussed in the thread) for indexing. Frontend: JavaScript/Browser Extension.
Difficulty	High
Monetization	Hobby

Notes

Why HN commenters would love it: Directly implements the concept mentioned by adverbly: "It would actually be so interesting to have comment, replies and thread associations according to semantic meaning rather than direct links."
Potential for discussion or practical utility: This would generate massive utility by reducing meta-discussions about recurring topics (e.g., "Hanlon's razor" fatigue mentioned by JacobThreeThree).

Comparative Vector Model Benchmarker (CVMB)

Summary

A service that simplifies the complex task of selecting or benchmarking embedding models for specific corpora, allowing users to compare different models based on relevance, size, and licensing concerns.
Core Value Proposition: Providing clear, actionable metrics for "which vector search is worth the juice," addressing the confusion highlighted by SchwKatze and developers discussing model choices (minimaxir, dangoodmanUT).

Details

Key	Value
Target Audience	Developers and researchers building RAG/semantic search applications, especially those concerned about model size/license (simonw, kaycebasques).
Core Feature	An interactive leaderboard/platform where users can upload small reference datasets (or use standardized corpus subsets like Gutenberg novels discussed by SteveJS) and run performance tests against the latest open and closed embedding models, focusing on vector density, retrieval accuracy (on relevance tasks), and licensing compatibility (e.g., flagging Gemma's license).
Tech Stack	Backend: Python ecosystem (Sentence Transformers, MTEB Leaderboard integration). Frontend: Data visualization library (e.g., React/D3) to display comparisons of latency, file size, and MTEB scores.
Difficulty	Medium
Monetization	Hobby

Notes

Why HN commenters would love it: It directly facilitates the technical conversations happening towards the end of the thread regarding model quality ("BGE M3 on my retrieval tasks," questioning MiniLM's age) and practical considerations (client-side size).
Potential for discussion or practical utility: It creates a single source of truth for the rapidly evolving embedding landscape, useful for anyone trying to build their "own database" (edwardzcn).