Project ideas from Hacker News discussions.

Postgres data stored in Parquet on S3: LTAP architecture explained

📝 Discussion Summary (Click to expand)

Top 3 Themes from the discussion

  1. Separating hot vs. cold data
    Streaming replication gives full row history, but OLTP disks cannot store that history at scale. Most teams move “cold/archival” data to an OLAP store and keep only recent rows in the OLTP DB.

    “We ended up with 'hot' data in oltp and 'cold/archival' data in olap because the storage size of oltp has always been limited.” – eveningtree

  2. SCD Type 2 via CDC pipelines or time‑travel tables
    Historical rows are built either by capturing logical changes in a pipeline (windowed lag on timestamps) or by using warehouse features such as Iceberg/Delta time travel. This creates a reliable versioned view without forcing SCD 2 tables inside the source DB.

    “If safe keeper exposes the changes to the tables somehow, a type2 scd is just a windowed lag over the primary key sorted by the timestamp.” – nikita

  3. Skepticism about unified storage / LTAP architecture
    Many commenters doubt that a single storage engine can replace purpose‑built OLTP and OLAP systems, citing CDC overhead, timestamp mis‑alignment, and real‑world cost concerns (e.g., S3 pricing, replication tax).

    “I’m skeptical of the idea of unifying storage just because it sounds ‘elegant’ or ‘cool’.” – saisrirampur


🚀 Project Ideas

ChronoDB: Inline Temporal PostgreSQL Extension

Summary

  • A built‑in PostgreSQL extension that automatically captures row‑level changes, stores them as SCD‑type 2 timelines, and exposes them via native SQL time‑travel queries.
  • Eliminates the need for separate CDC pipelines or external history tables, keeping full history inside the OLTP engine.

Details

Key Value
Target Audience PostgreSQL DBAs & developers building audit‑heavy or versioned applications
Core Feature Automatic SCD‑type 2 versioning with valid_from/valid_to columns and AS OF time‑travel queries
Tech Stack PostgreSQL C extension, logical replication slots, embedded columnar cache, Rust for performance
Difficulty Medium
Monetization Revenue-ready: Subscription per active database node

Notes

  • HN users repeatedly lament the “copy‑to‑OLAP” step and CDC overhead – this solution removes that duplication while staying inside PostgreSQL.
  • Discussion potential: Could be adopted by companies like Nexitch (LTAP) to unify storage without moving data to separate warehouses.

DeltaLake Sync for PostgreSQL CDC

Summary

  • A lightweight sync service that ingests PostgreSQL WAL/CDC events and writes immutable Parquet files to an object store, building a Delta Lake/Iceberg‑compatible history layer.
  • Provides cheap, scalable storage for historical rows and built‑in time‑travel queries, while keeping hot data in PostgreSQL for OLTP.

Details

Key Value
Target Audience Data engineers & platform teams managing hybrid OLTP/OLAP workloads
Core Feature Real‑time CDC → Delta Lake tables with versioned snapshots and AS OF queries
Tech Stack Kafka/pg_recvlogical, Python/Lightning, Delta Lake libraries, MinIO/R2 backend
Difficulty Medium
Monetization Revenue-ready: Pay‑per‑TB stored + query compute minutes

Notes

  • Commenters like hasyimibhar suggested that version‑controlled DBs (e.g., Dolt) could replace SCD tables, but noted timestamp mismatches – this service addresses the timestamp‑accuracy problem by using event timestamps from the source.
  • HN interest likely in avoiding costly S3 bandwidth bills and using self‑hosted S3‑compatible stores for cheaper archival.

VaultLake: Cost‑Optimized Historical Archive for PostgreSQL

Summary

  • A self‑hosted archival service that streams PostgreSQL WAL to a columnar Parquet store on inexpensive object storage, tiering hot rows in‑memory for OLTP and cold rows on cheap storage, enabling fast analytical queries over full history.
  • Provides automatic SCD‑type 2 row versioning using windowed lag over primary keys, eliminating manual ETL pipelines.

Details

Key Value
Target Audience Enterprises with large historical datasets needing auditability and low storage costs
Core Feature Tiered storage: NVMe cache for recent writes, Parquet archive on S3‑compatible storage; query acceleration layer
Tech Stack PostgreSQL logical decoding, Rust, Parquet writer, MinIO/R2 or Backblaze B2
Difficulty High
Monetization Revenue-ready: Tiered pricing based on archive volume and query throughput

Notes

  • Users like nikita and zoltan_ emphasized the need for cheap, scalable storage and skepticism about S3 bandwidth; this project directly tackles those concerns with a plug‑and‑play archival layer.
  • Potential for HN discussion around replacing separate OLAP pipelines with a unified storage approach that still serves low‑latency OLTP reads.

Read Later