From Transactional Bottlenecks to Lightning-Fast Analytics

At XaasIO, our data platform architecture has evolved organically through the typical journey every scalable platform undergoes. Initially, we leaned heavily on Online Transaction Processing (OLTP) databases – primarily PostgreSQL – to manage

Customer-specific configurations
Feature metadata
Access control policies
Microservices-related data
Data generated by microservices

This architecture worked fine when data access requirements were simple. However, as the platform expanded and use cases diversified, especially category 5, we encountered increasing difficulty managing and extracting value from this data. Engineers often had to manually query databases, write custom scripts, and stitch data across stores, lacking a unified view.

As each team built its own microservices and databases optimized for its workloads, data fragmentation grew. While some decisions followed microservices best practices, others were driven by rapid development demands or just developer convenience.

Simultaneously, we were building a petabyte-scale data lake to store and analyze cybersecurity telemetry and other analytics data. However, the demand to join OLTP and data lake data in analytical workflows exposed the limitations of our setup. We needed a solution to seamlessly query across data sources without migrating everything to a single system.

Embracing Federated Query Engines

We explored federated query engines like Trino, Apache Drill, and Dremio. We chose Trino (formerly PrestoSQL) for its performance, scalability, ANSI SQL support, and rich connector ecosystem.

With Trino, we could:

Query across multiple OLTP databases and the data lake
Avoid heavy ETL pipelines
Minimize data movement

But as data volume exploded, especially in JOIN-heavy queries between OLTP and the data lake, performance bottlenecks emerged.

Bottlenecks and Their Root Causes

1. PostgreSQL Contention

We used PostgreSQL for transactional and real-time read workloads. As analytical queries grew:

Replication Lag became a problem due to heavy write loads
Long-running Queries clashed with replication, causing query cancellations
Write Latency increased with vacuuming, indexing, and maintenance overhead

2. Federated Query Limitations

Joining PostgreSQL with Hive-based Parquet/ORC data resulted in

Single-node Execution Bottlenecks (JDBC sequential reads from PostgreSQL)
No Join Pushdown (forcing large datasets into Trino memory)
Limited Parallelism (PostgreSQL connector lacked split generation)
Latency in Multi-source Queries (caused dashboard delays)

3. Lack of a Unified Source of Truth

We considered two paths:

Direct Lakehouse for Microservices: Microservices high migration cost are risky
ETL to Data Lake introduced data freshness issues

Neither option supported the real-time consistency needed for operational and analytical parity.

The Shift to CDC + Hudi

We needed a hybrid approach that could

Offload hot data from PostgreSQL
Ensure real-time consistency between OLTP and analytics
Enable scalable, performant lakehouse querying

That’s when we adopted Change Data Capture (CDC) using Debezium and built a CDC pipeline into Apache Hudi.

Architecture Overview

Section 1: CDC Capture

PostgreSQL – Source of truth
Debezium – Captures changes via WAL replication and streams to Kafka

Section 2: Event Decoration

Kafka – Event buffer and stream broker
DebeziumConsumer – Transforms events, enriches data, and re-publishes for ingestion

Section 3: Lake Ingestion

Spark Streaming – Writes data to Hudi with upserts
Data Lake (S3/HDFS) Storage of raw/transformed ORC/Parquet files
Trino Query layer across lakehouse

Why Apache Hudi?

We chose Hudi over Delta Lake and Iceberg due to

Native support for upserts and incremental ingestion
Merge-on-Read (MOR) mode for low-latency updates
Efficient storage with Bloom filters, bucketed layout
Strong support for real-time analytics
Incremental querying to avoid full-table scans

Lessons and Challenges

PostgreSQL failover: Added snapshot support and sync workflows
Large payloads introduced splitting logic
Column Skipping & Deduplication Enhanced beyond Debezium’s defaults
Latency tuning optimized for sub-minute sync across 100+ tables
Custom Framework For connector management, schema evolution, and alerting

The Outcome: A Real-Time Analytical Backbone

✅ Offloaded PostgreSQL

Reduced read QPS by 10,000+, boosting OLTP reliability

✅ 10x Faster Dashboards

Trino + Hudi enabled fast, federated queries at scale

✅ Sub-minute Data Sync

OLTP and analytics now operate in near real-time harmony

✅ Asynchronous Cache Sync

Leveraging CDC events to keep Redis caches fresh and consistent without DB hits

✅ Reduced Infra Cost

Extended PostgreSQL lifespan and cut unnecessary scaling

From Transactional Bottlenecks to Lightning-Fast Analytics

Embracing Federated Query Engines

Bottlenecks and Their Root Causes

The Shift to CDC + Hudi

Architecture Overview

Why Apache Hudi?

The Outcome: A Real-Time Analytical Backbone

Related Articles

Generative AI Needs to Become Private to Thrive: Introducing Private AI by XaasIO

Redefining Data Ingestion and Processing Excellence

Get in Touch with Our Customer Success Team.

From Transactional Bottlenecks to Lightning-Fast Analytics

Embracing Federated Query Engines

Bottlenecks and Their Root Causes

The Shift to CDC + Hudi

Architecture Overview

Why Apache Hudi?

The Outcome: A Real-Time Analytical Backbone

Related Articles

Generative AI Needs to Become Private to Thrive: Introducing Private AI by XaasIO

Redefining Data Ingestion and Processing Excellence

Get in Touch with Our Customer Success Team.

Privacy Settings

Functional

Preferences