Skip to main content

From Transactional Bottlenecks to Lightning-Fast Analytics

At XaasIO, our data platform architecture has evolved organically through the typical journey every scalable platform undergoes. Initially, we leaned heavily on Online Transaction Processing (OLTP) databases – primarily PostgreSQL – to manage

  • Customer-specific configurations
  • Feature metadata
  • Access control policies
  • Microservices-related data
  • Data generated by microservices

This architecture worked fine when data access requirements were simple. However, as the platform expanded and use cases diversified, especially category 5, we encountered increasing difficulty managing and extracting value from this data. Engineers often had to manually query databases, write custom scripts, and stitch data across stores, lacking a unified view.

As each team built its own microservices and databases optimized for its workloads, data fragmentation grew. While some decisions followed microservices best practices, others were driven by rapid development demands or just developer convenience.

Simultaneously, we were building a petabyte-scale data lake to store and analyze cybersecurity telemetry and other analytics data. However, the demand to join OLTP and data lake data in analytical workflows exposed the limitations of our setup. We needed a solution to seamlessly query across data sources without migrating everything to a single system.

Embracing Federated Query Engines

We explored federated query engines like Trino, Apache Drill, and Dremio. We chose Trino (formerly PrestoSQL) for its performance, scalability, ANSI SQL support, and rich connector ecosystem.

With Trino, we could:

  • Query across multiple OLTP databases and the data lake
  • Avoid heavy ETL pipelines
  • Minimize data movement

But as data volume exploded, especially in JOIN-heavy queries between OLTP and the data lake, performance bottlenecks emerged.

Bottlenecks and Their Root Causes

1. PostgreSQL Contention

We used PostgreSQL for transactional and real-time read workloads. As analytical queries grew:

  • Replication Lag became a problem due to heavy write loads
  • Long-running Queries clashed with replication, causing query cancellations
  • Write Latency increased with vacuuming, indexing, and maintenance overhead

2. Federated Query Limitations

Joining PostgreSQL with Hive-based Parquet/ORC data resulted in

  • Single-node Execution Bottlenecks (JDBC sequential reads from PostgreSQL)
  • No Join Pushdown (forcing large datasets into Trino memory)
  • Limited Parallelism (PostgreSQL connector lacked split generation)
  • Latency in Multi-source Queries (caused dashboard delays)

3. Lack of a Unified Source of Truth

We considered two paths:

  • Direct Lakehouse for Microservices: Microservices high migration cost are risky
  • ETL to Data Lake introduced data freshness issues

Neither option supported the real-time consistency needed for operational and analytical parity.

The Shift to CDC + Hudi

We needed a hybrid approach that could

  • Offload hot data from PostgreSQL
  • Ensure real-time consistency between OLTP and analytics
  • Enable scalable, performant lakehouse querying

That’s when we adopted Change Data Capture (CDC) using Debezium and built a CDC pipeline into Apache Hudi.

Architecture Overview

Section 1: CDC Capture

  • PostgreSQL – Source of truth
  • Debezium – Captures changes via WAL replication and streams to Kafka

Section 2: Event Decoration

  • Kafka – Event buffer and stream broker
  • DebeziumConsumer – Transforms events, enriches data, and re-publishes for ingestion

Section 3: Lake Ingestion

  • Spark Streaming – Writes data to Hudi with upserts
  • Data Lake (S3/HDFS) Storage of raw/transformed ORC/Parquet files
  • Trino Query layer across lakehouse

Why Apache Hudi?

We chose Hudi over Delta Lake and Iceberg due to

  • Native support for upserts and incremental ingestion
  • Merge-on-Read (MOR) mode for low-latency updates
  • Efficient storage with Bloom filters, bucketed layout
  • Strong support for real-time analytics
  • Incremental querying to avoid full-table scans

Lessons and Challenges

  • PostgreSQL failover: Added snapshot support and sync workflows
  • Large payloads introduced splitting logic
  • Column Skipping & Deduplication Enhanced beyond Debezium’s defaults
  • Latency tuning optimized for sub-minute sync across 100+ tables
  • Custom Framework For connector management, schema evolution, and alerting

The Outcome: A Real-Time Analytical Backbone

Offloaded PostgreSQL

Reduced read QPS by 10,000+, boosting OLTP reliability

10x Faster Dashboards

Trino + Hudi enabled fast, federated queries at scale

Sub-minute Data Sync

OLTP and analytics now operate in near real-time harmony

Asynchronous Cache Sync

Leveraging CDC events to keep Redis caches fresh and consistent without DB hits

Reduced Infra Cost

Extended PostgreSQL lifespan and cut unnecessary scaling

Get in Touch with Our Customer Success Team.

Determine ROIs, oversee migrations, initiate complimentary PoCs, and access a team prepared to swiftly evaluate subsequent actions.