Postgres data stored in Parquet on S3: LTAP architecture explained

📝 Discussion Summary (Click to expand)

Top 3 Themes from the discussion

Separating hot vs. cold data
Streaming replication gives full row history, but OLTP disks cannot store that history at scale. Most teams move “cold/archival” data to an OLAP store and keep only recent rows in the OLTP DB.

“We ended up with 'hot' data in oltp and 'cold/archival' data in olap because the storage size of oltp has always been limited.” – eveningtree
SCD Type 2 via CDC pipelines or time‑travel tables
Historical rows are built either by capturing logical changes in a pipeline (windowed lag on timestamps) or by using warehouse features such as Iceberg/Delta time travel. This creates a reliable versioned view without forcing SCD 2 tables inside the source DB.

“If safe keeper exposes the changes to the tables somehow, a type2 scd is just a windowed lag over the primary key sorted by the timestamp.” – nikita
Skepticism about unified storage / LTAP architecture
Many commenters doubt that a single storage engine can replace purpose‑built OLTP and OLAP systems, citing CDC overhead, timestamp mis‑alignment, and real‑world cost concerns (e.g., S3 pricing, replication tax).

“I’m skeptical of the idea of unifying storage just because it sounds ‘elegant’ or ‘cool’.” – saisrirampur

🚀 Project Ideas

A built‑in PostgreSQL extension that automatically captures row‑level changes, stores them as SCD‑type 2 timelines, and exposes them via native SQL time‑travel queries.
Eliminates the need for separate CDC pipelines or external history tables, keeping full history inside the OLTP engine.

Key	Value
Target Audience	PostgreSQL DBAs & developers building audit‑heavy or versioned applications
Core Feature	Automatic SCD‑type 2 versioning with `valid_from`/`valid_to` columns and `AS OF` time‑travel queries
Tech Stack	PostgreSQL C extension, logical replication slots, embedded columnar cache, Rust for performance
Difficulty	Medium
Monetization	Revenue-ready: Subscription per active database node

HN users repeatedly lament the “copy‑to‑OLAP” step and CDC overhead – this solution removes that duplication while staying inside PostgreSQL.
Discussion potential: Could be adopted by companies like Nexitch (LTAP) to unify storage without moving data to separate warehouses.

A lightweight sync service that ingests PostgreSQL WAL/CDC events and writes immutable Parquet files to an object store, building a Delta Lake/Iceberg‑compatible history layer.
Provides cheap, scalable storage for historical rows and built‑in time‑travel queries, while keeping hot data in PostgreSQL for OLTP.

Key	Value
Target Audience	Data engineers & platform teams managing hybrid OLTP/OLAP workloads
Core Feature	Real‑time CDC → Delta Lake tables with versioned snapshots and `AS OF` queries
Tech Stack	Kafka/pg_recvlogical, Python/Lightning, Delta Lake libraries, MinIO/R2 backend
Difficulty	Medium
Monetization	Revenue-ready: Pay‑per‑TB stored + query compute minutes

Commenters like hasyimibhar suggested that version‑controlled DBs (e.g., Dolt) could replace SCD tables, but noted timestamp mismatches – this service addresses the timestamp‑accuracy problem by using event timestamps from the source.
HN interest likely in avoiding costly S3 bandwidth bills and using self‑hosted S3‑compatible stores for cheaper archival.

A self‑hosted archival service that streams PostgreSQL WAL to a columnar Parquet store on inexpensive object storage, tiering hot rows in‑memory for OLTP and cold rows on cheap storage, enabling fast analytical queries over full history.
Provides automatic SCD‑type 2 row versioning using windowed lag over primary keys, eliminating manual ETL pipelines.

Key	Value
Target Audience	Enterprises with large historical datasets needing auditability and low storage costs
Core Feature	Tiered storage: NVMe cache for recent writes, Parquet archive on S3‑compatible storage; query acceleration layer
Tech Stack	PostgreSQL logical decoding, Rust, Parquet writer, MinIO/R2 or Backblaze B2
Difficulty	High
Monetization	Revenue-ready: Tiered pricing based on archive volume and query throughput

Users like nikita and zoltan_ emphasized the need for cheap, scalable storage and skepticism about S3 bandwidth; this project directly tackles those concerns with a plug‑and‑play archival layer.
Potential for HN discussion around replacing separate OLAP pipelines with a unified storage approach that still serves low‑latency OLTP reads.