From Transactional Bottlenecks to Lightning-Fast Analytics

At XaasIO, our data platform architecture has evolved organically through the typical journey every scalable platform undergoes. Initially, we leaned heavily on Online Transaction Processing (OLTP) databases – primarily PostgreSQL – to manage
- Customer-specific configurations
- Feature metadata
- Access control policies
- Microservices-related data
- Data generated by microservices
This architecture worked fine when data access requirements were simple. However, as the platform expanded and use cases diversified, especially category 5, we encountered increasing difficulty managing and extracting value from this data. Engineers often had to manually query databases, write custom scripts, and stitch data across stores, lacking a unified view.
As each team built its own microservices and databases optimized for its workloads, data fragmentation grew. While some decisions followed microservices best practices, others were driven by rapid development demands or just developer convenience.
Simultaneously, we were building a petabyte-scale data lake to store and analyze cybersecurity telemetry and other analytics data. However, the demand to join OLTP and data lake data in analytical workflows exposed the limitations of our setup. We needed a solution to seamlessly query across data sources without migrating everything to a single system.
Embracing Federated Query Engines
We explored federated query engines like Trino, Apache Drill, and Dremio. We chose Trino (formerly PrestoSQL) for its performance, scalability, ANSI SQL support, and rich connector ecosystem.
With Trino, we could:
- Query across multiple OLTP databases and the data lake
- Avoid heavy ETL pipelines
- Minimize data movement
But as data volume exploded, especially in JOIN-heavy queries between OLTP and the data lake, performance bottlenecks emerged.
Bottlenecks and Their Root Causes
1. PostgreSQL Contention
We used PostgreSQL for transactional and real-time read workloads. As analytical queries grew:
- Replication Lag became a problem due to heavy write loads
- Long-running Queries clashed with replication, causing query cancellations
- Write Latency increased with vacuuming, indexing, and maintenance overhead
2. Federated Query Limitations
Joining PostgreSQL with Hive-based Parquet/ORC data resulted in
- Single-node Execution Bottlenecks (JDBC sequential reads from PostgreSQL)
- No Join Pushdown (forcing large datasets into Trino memory)
- Limited Parallelism (PostgreSQL connector lacked split generation)
- Latency in Multi-source Queries (caused dashboard delays)
3. Lack of a Unified Source of Truth
We considered two paths:
- Direct Lakehouse for Microservices: Microservices high migration cost are risky
- ETL to Data Lake introduced data freshness issues
Neither option supported the real-time consistency needed for operational and analytical parity.
The Shift to CDC + Hudi
We needed a hybrid approach that could
- Offload hot data from PostgreSQL
- Ensure real-time consistency between OLTP and analytics
- Enable scalable, performant lakehouse querying
That’s when we adopted Change Data Capture (CDC) using Debezium and built a CDC pipeline into Apache Hudi.
Architecture Overview
Section 1: CDC Capture
- PostgreSQL – Source of truth
- Debezium – Captures changes via WAL replication and streams to Kafka
Section 2: Event Decoration
- Kafka – Event buffer and stream broker
- DebeziumConsumer – Transforms events, enriches data, and re-publishes for ingestion
Section 3: Lake Ingestion
- Spark Streaming – Writes data to Hudi with upserts
- Data Lake (S3/HDFS) Storage of raw/transformed ORC/Parquet files
- Trino Query layer across lakehouse
Why Apache Hudi?
We chose Hudi over Delta Lake and Iceberg due to
- Native support for upserts and incremental ingestion
- Merge-on-Read (MOR) mode for low-latency updates
- Efficient storage with Bloom filters, bucketed layout
- Strong support for real-time analytics
- Incremental querying to avoid full-table scans
Lessons and Challenges
- PostgreSQL failover: Added snapshot support and sync workflows
- Large payloads introduced splitting logic
- Column Skipping & Deduplication Enhanced beyond Debezium’s defaults
- Latency tuning optimized for sub-minute sync across 100+ tables
- Custom Framework For connector management, schema evolution, and alerting
The Outcome: A Real-Time Analytical Backbone
✅ Offloaded PostgreSQL
Reduced read QPS by 10,000+, boosting OLTP reliability
✅ 10x Faster Dashboards
Trino + Hudi enabled fast, federated queries at scale
✅ Sub-minute Data Sync
OLTP and analytics now operate in near real-time harmony
✅ Asynchronous Cache Sync
Leveraging CDC events to keep Redis caches fresh and consistent without DB hits
✅ Reduced Infra Cost
Extended PostgreSQL lifespan and cut unnecessary scaling

