Revolutionizing data architecture: how we built a self-managing data platform at Akua

Juan Jose Behrend

Jan 135 min read

Have you ever felt that traditional data solutions are like using a rocket launcher to light a candle? Yes, we have been there too. At Akua, we have been through the journey of complex data architectures - from data warehouses powered by Pentaho (a complete data integration and Business Intelligence suite including ETL, reporting and visualization) to near real-time data lakes built on:

Apache Spark: A distributed processing engine that enables processing large volumes of data in memory
Amazon EMR (Elastic MapReduce): AWS managed service for running frameworks such as Spark, Hive, and others
Delta Lake: A storage layer that brings ACID transactionality to data lakes
Amazon Kinesis: Service for real-time data ingestion and processing
Amazon S3: Scalable Storage

While these are powerful tools and in many cases make sense to implement, they often felt like taking a Formula 1 car to the grocery store - impressive but impractical for our current needs.

The challenge: finding our "levitating train"

As Mathias Parodi, our Head of Engineering, often says, we needed a solution that “levitates like a train ” . Something that would effortlessly glide through our data needs while keeping its feet on the ground. Our requirements were crystal clear:

Maintenance should be a piece of cake, not a nightmare
Perfect fit for our current scale and 2-3 year horizon
Economical but lightning fast
Delivering immediate value to the business

The solution: embracing simplicity with power

The foundation: PostgreSQL as our data warehouse

Instead of jumping on the latest data fad bandwagon, we took a step back and looked at traditional PostgreSQL with fresh eyes. This battle-tested database became our data warehouse, starting with a “raw data” layer (data exactly like the source or operational databases), with real-time replication via AWS DMS (Database Migration Service, a service that allows data to be replicated in real-time between different databases) for relational databases and DynamoDB Streams (a service that captures real-time changes to DynamoDB tables) along with AWS Lambda (a serverless service that runs code in response to events) for our NoSQL data.

This approach allowed us to:

Real-time replication with latency less than 1 second
Zero data loss thanks to DMS checkpoint mechanisms
In-flight transformations using DMS mapping capabilities
Serverless processing that automatically scales with load

Handling semi-structured data like a champ

One of our biggest successes? Leveraging PostgreSQL’s JSONB capabilities (a binary data type that stores JSON documents in an optimized way) to handle DynamoDB data without breaking a sweat. Advantages of the JSONB type include:

Compressed and efficient storage
GIN indexing for fast searches within JSON
Native operators for querying and manipulating JSON data
Full support for standard SQL queries

The performance? Simply mind-blowing. In our tests, we got:

Queries with predicates on JSON fields in less than 50ms
Aggregates over millions of records in seconds
Efficient joins between structured and semi-structured data This approach gave us the flexibility of NoSQL with the reliability of a traditional warehouse.

Infrastructure as Code: The Magic of IDP

Remember our Internal Development Platform (IDP)? Our platform team put on their data engineering hats and did some magic. In just two weeks, we had a fully automated real-time data replication system. No manual interventions – pure good old automation.

AI-powered data modeling

This is where things get interesting. We combined our carefully crafted entity-relationship diagrams with AI to design a future-proof data mart of facts and dimensions. But we didn’t stop there – we built an AI-powered system that automatically detects and adapts to new tables. It’s like having a data model that evolves on its own and grows with your business.

Orchestration: Keep it simple, keep it real-time

While tools like Airflow are great, we chose a different path. Using n8n (an open source automation platform that allows you to create complex workflows using a visual interface), we built a robust orchestration system. N8n gives us:

Over 200 pre-built integrations with popular services
Ability to run custom code in JavaScript
Visual interface for designing and debugging flows
Webhooks and scheduled triggers
Integrated queuing system for asynchronous processes
Error handling and automatic retries

With these capabilities, we create a robust orchestration system that includes:

Real-time data synchronization
Slack notifications for data quality issues
Automated monitoring and alerts
Integrated data quality controls

The result: a data platform that just works

Our final architecture delivers:

A clean and well-structured data mart with:
- Fact tables for payments
- Dimension tables for customers, merchants, payment instruments
- Complete denormalization for ultra-fast queries
Zero ETL integration with Amazon Redshift (a cloud data warehouse service optimized for analytics), letting it do what it does best - ultra-fast columnar queries without the overhead of joins. Redshift provides us with:
Columnar storage that dramatically reduces I/O on analytical queries
Massively parallel processing (MPP) for distributing queries
Automatic compression based on data type
Separate scaling of compute and storage
Ability to query data directly in S3 (Redshift Spectrum)
Automatic query optimization and maintenance

With Zero ETL functionality, we achieve:

Automatic replication from PostgreSQL to Redshift
Near real-time synchronization (less than 2 minutes latency)
No need for additional ETL pipelines
Consistency guaranteed between source and destination

Visualization: Metabase for victory

After years of struggling with “world-class” (read: complicated and expensive) BI tools, we found our perfect match in Metabase, an open source Business Intelligence platform designed to be easy to use without sacrificing power. Why?

Metabase gives us enterprise features without the traditional complexity:

Query engine that allows you to create analysis without knowing SQL
Ability to write direct SQL queries when needed
Smart cache system for frequent queries
Granular access control at table and row level
SSO and LDAP authentication
Full API for automation and integration
Embedded Analytics via SDK
Creating dashboards with one click
Native support for SQL queries
AI-powered features that are constantly improving

The impact: from zero to live in 30 days

In just one month, we built:

A self-managed data warehouse
Automatic ingestion for new data sources
Self-evolving data mart
Sub-100ms response times on millions of records

The secret? Simplicity

This wasn’t just another technology implementation – it was rethinking how modern data platforms should work. By choosing simplicity over complexity, automation over manual processes, and practical solutions over trendy technology, we built something that truly serves our needs.

The result? A data platform as agile as a startup needs to be, but robust enough to handle enterprise-scale data. It's living proof that sometimes the best solutions aren't the most complex - they're the ones that fit your needs perfectly while leaving room to grow.

Special thanks to the amazing team at Akua, most notably Luispe Toloy and German Yepes and our extended family of consultants who helped shape this elegant, efficient and remarkably simple solution. This is what happens when experience meets innovation, and we couldn’t be more proud of the result.