Project ideas from Hacker News discussions.

A case study in PDF forensics: The Epstein PDFs

📝 Discussion Summary (Click to expand)

Prevalent Themes in the Discussion

1. Year formatting in submission titles Users debate whether to include the current year, noting the article is recent.

"We're in early February ~2025~ 2026" – michaelmcdonald

2. Redaction, removal, and archival concerns

Multiple participants worry that documents are being taken down or re‑redacted, urging independent preservation.

"Are they being removed or replaced with more heavily redacted documents?" – thatguy0900 > "That's a lot of PeDoFiles! (But seriously, great work here!)" – tibbon

3. Technical deep‑dive on PDF processing, OCR, and metadata

Extensive discussion of how the files are being OCR’d, scanned, and stripped of metadata. > "Initially under “Epstein Files Transparency Act”… all datasets had .zip links… now it seems like most are back again." – embedding‑shape

4. Political blame and partisan speculation

Speculation about which parties are complicit, suppression of releases, and broader cultural implications.

"Democrats are complicit as well. Don't let them off the hook by making the mistake of thinking they're simply weak." – krapp


🚀 Project Ideas

[Automated PDF Sanitizer]

Summary- Performs batch removal of EXIF/metadata and searches for hidden steganography in PDFs released publicly.

  • Prevents inadvertent leakage of identifying information when authorities share scanned documents.
  • Generates clean, share‑ready copies for journalists, researchers, or activists.

Details

Key Value
Target Audience Researchers, journalists, FOIA requesters, activists
Core Feature Bulk PDF processing with EXIF stripping, metadata analysis, and steganography detection
Tech Stack Python (PyMuPDF, Pillow, Stegano), Docker for sandboxed execution
Difficulty Medium
Monetization Hobby

Notes- Directly addresses HN remarks about JPEGs containing EXIF and concerns over metadata in released PDFs.

  • Could become a community‑maintained tool for open‑source leak‑handling workflows.

[Document Provenance & Archival Network]

Summary- Builds a decentralized, immutable repository for archiving leaked PDFs and tracking revisions.

  • Preserves versions so documents cannot be silently removed or altered after release.
  • Enables verification of authenticity via cryptographic hashes and change alerts.

Details

Key Value
Target Audience Archivists, historians, investigative journalists, FOIA researchers
Core Feature Content‑addressed storage with version diffing and provenance metadata
Tech Stack IPFS + PostgreSQL, Docker, React front‑end, automated hash verification scripts
Difficulty High
Monetization Revenue-ready: Subscription tier for institutions and NGOs

Notes

  • Responds to HN concerns about “someone is independently archiving all documents” and the risk of files being taken down.
  • Provides a transparent audit trail that could be referenced in future inquiries.

[Stylometry Matcher for Leak Authorship]

Summary

  • Offers a web UI to upload collections of documents and compare their writing styles against a large pool of suspects.
  • Generates similarity scores, heat‑maps, and candidate author lists to aid pattern‑finding in anonymous leaks.
  • Reduces manual effort required for the stylometric analysis discussed on HN.

Details

Key Value
Target Audience Researchers, investigative journalists, legal analysts
Core Feature Upload‑based batch processing with visual similarity maps and author suggestions
Tech Stack Python (sentence‑transformers, spaCy), Flask API, React UI, PostgreSQL backend
Difficulty Medium
Monetization Revenue-ready: Cloud API with pay‑per‑call pricing

Notes- Inspired by HN debates on stylometry, including comments by Der_Einzige and the desire to automate detection.

  • Could help verify claims that specific authors edited redacted documents.

[Batch OCR Consistency Monitor]

Summary

  • Monitors OCR output of large PDF batches, flagging anomalies like misplaced symbols, “=” characters, or page‑skew inconsistencies.
  • Provides a dashboard of confidence scores and highlights pages where OCR diverges from expected patterns.
  • Saves analysts hours of manual verification when processing leaked document sets.

Details

Key Value
Target Audience Researchers, data engineers, archivists processing large PDF corpora
Core Feature Real‑time consistency checking, anomaly flagging, and export of mismatched extracts
Tech Stack Node.js (Puppeteer), Tesseract + AllenAI/OlmoCr‑2‑7B, Redis cache, Grafana UI
Difficulty Medium
Monetization Hobby

Notes

  • Addresses the “random ”=”” issue highlighted in the thread where embedding‑shape noted OCR mismatches.
  • Could integrate with the proposed PDF Sanitizer to streamline end‑to‑end leak analysis pipelines.

Read Later