A short overview on Lakehouse open formats

For years, data architecture discussions often felt like choosing between two imperfect options, forcing a difficult compromise. On one side, the traditional data warehouse offered structure, reliability, and powerful analytics. However, it was often rigid, expensive, and tended to create isolated data silos. On the other side, the data lake provided flexibility, cost-effectiveness, and vast storage, but frequently became a chaotic, unreliable “data swamp.” This tension created a persistent challenge for leaders striving to build a data strategy that was both agile and trustworthy. This dilemma mirrors a broader business challenge: how to balance stability with innovation.

Source: image generated with gpt4o

The good news is that this debate is rapidly becoming obsolete. A new architectural paradigm, the Lakehouse, has emerged to resolve this conflict by combining the best features of both worlds. As Databricks defines it, a lakehouse is “a new, open architecture that combines the best elements of data lakes and data warehouses.” At the heart of this evolution are open table formats—a crucial technology that brings reliability, performance, and governance to the vast, low-cost storage of a data lake. This article will serve as your essential guide to this new landscape. We will explore what these formats are, how they enable the revolutionary separation of storage and compute, how the leading contenders compare, and most importantly, how you can choose the right one for your organization.

Now that we’ve set the stage, let’s have a detailed view on why the separation of storage and compute is so revolutionary.

Datalake ≠ Lakehouse

Despite its similarity in their names they are not the same thing. A Data Lake is a centralized repository designed to store raw, unstructured, semi-structured, and structured data at scale, offering flexibility in data ingestion and storage without immediate schema enforcement. In contrast, a Lakehouse combines the flexible storage of a Data Lake with structured querying capabilities and robust data governance typically found in traditional Data Warehouses. This unified architecture enables both high-performance analytics and data science workloads, merging the scalability of data lakes with the transactional consistency and schema enforcement of data warehouses.

Why Separating Storage and Compute Changes Everything

To truly appreciate the transformation the Lakehouse represents, it’s helpful to understand the architecture it replaces. Traditional data warehouses were built on a coupled architecture. This means the systems that store data and the systems that process it were inextricably linked. Imagine a classic restaurant where the pantry (storage) is physically inseparable from the kitchen (compute). If you need a bigger pantry, you must build a bigger kitchen, and vice versa. This model works at a small scale, but it imposes severe limitations. Scaling one component requires scaling the other, leading to exorbitant costs and operational inflexibility, especially when your data or query volumes grow unpredictably.

The Lakehouse architecture fundamentally breaks this dependency. It embraces a new model where storage and compute are decoupled. Your data resides in vast, low-cost object stores like Amazon S3 or Google Cloud Storage. Meanwhile, powerful query engines like Spark, Trino, or Flink can be spun up on demand to process that data. This is akin to having a massive, centralized pantry that multiple, specialized kitchens can access as needed—a pop-up bistro for quick queries, or a large commercial kitchen for heavy-duty analytics. The benefits are profound. As the glossary from Secoda notes, “Separating compute and storage can offer several benefits to businesses, including cost-efficiency, scalability, and flexibility.” You pay only for the compute resources you use, and you can scale your storage and processing power independently, aligning costs directly with value.

However, this separation creates a new and critical problem: if the data is just a collection of files in a storage bucket, how does a query engine know which files constitute the current, correct version of a table? How does it handle simultaneous reads and writes without corrupting the data? This was the crucial bridge that was missing, the challenge that prevented data lakes from truly supporting mission-critical analytics.

So, how do we manage data in a decoupled system? That’s where open table formats come in.

What Are Open Table Formats, Really?

At their core, open table formats are a metadata layer that brings structure, reliability, and performance to raw data files stored in a data lake. They act as a sophisticated management system, transforming a simple collection of Parquet or ORC files into reliable, high-performance tables that can be queried like a traditional database. They are, in essence, the missing piece that elevates a data lake into a true Lakehouse.

Perhaps the best analogy is to think of them as a digital card catalog for your data lake library. A library without a catalog is just a warehouse of books; finding anything is a chaotic, manual process. The card catalog, however, doesn’t store the books themselves. Instead, it tracks what books are available, where they are located on the shelves, who has checked them out, and the history of different editions. Similarly, open table formats don’t store the data; they store metadata about the data files, enabling powerful capabilities that were once exclusive to data warehouses.

These capabilities are built on a few core benefits:

ACID Transactions: This is the feature that guarantees data reliability. ACID (Atomicity, Consistency, Isolation, Durability) ensures that every transaction is an “all or nothing” operation. As Databricks documentation explains, this prevents data corruption from failed writes or concurrent operations, ensuring that users querying the data always see a consistent and correct version.
Time Travel: Think of this as the ultimate undo button for your data. These formats maintain a versioned history of the table, allowing you to query the data as it existed at a specific point in time. This is invaluable for auditing, debugging data pipelines, or recovering from accidental updates or deletes.
Schema Evolution: This capability allows your data structure to change without breaking everything. As business needs evolve, you may need to add new columns, rename existing ones, or change data types. Schema evolution provides the flexibility to make these changes safely without rewriting the entire dataset or breaking downstream pipelines.
Openness: This is perhaps the most strategic benefit. Because these formats are open source, they prevent vendor lock-in. You are not tied to a single vendor’s proprietary format, allowing you to use a diverse ecosystem of query engines and tools on the same data.

Now that we understand the what and why, let’s meet the key players in the open table format landscape.

Meet the Contenders

The open table format space is dominated by three major players, each with a distinct origin story and a unique set of strengths. Delta Lake emerged from Databricks to optimize Spark workloads. Apache Iceberg was born at Netflix (and is now heavily backed by companies like Apple and AWS) to solve correctness problems at a massive scale. Apache Hudi was created at Uber to handle high-throughput streaming data ingestion.

Understanding their differences is key to making a strategic choice.

Feature / Axis	Delta Lake	Apache Iceberg	Apache Hudi
Primary Backer	Databricks	Broad (AWS, Apple, Netflix, Snowflake)	Broad (Uber, Onehouse)
Core Philosophy	Simplicity & Deep Spark Integration	Engine Agnosticism & Correctness	Streaming & Incremental Processing
ACID Support	Yes	Yes	Yes
Time Travel	Yes	Yes	Yes
Schema Evolution	Yes	Yes	Yes
Partition Evolution	Limited	Yes	Yes
View Support	Yes (Delta Sharing)	Yes	Yes
Write Modes	Copy-on-Write, Merge-on-Read (Delta Universal Format)	Copy-on-Write, Merge-on-Read	Copy-on-Write, Merge-on-Read
Key Integrations	Spark, Databricks Ecosystem	Spark, Flink, Trino, Snowflake, Dremio	Spark, Flink, Presto
Community Momentum	Strong (large user base)	Very Strong (fastest growing)	Strong (specialized community)

Delta Lake: The Mature Choice for the Spark Ecosystem

Delta Lake is the most mature of the three formats and is the default choice within the Databricks ecosystem. Its greatest strength is its seamless integration with Apache Spark, making it incredibly easy to adopt for teams already proficient in Spark. Delta Lake provides robust ACID (Atomicity, Consistency, Isolation, Durability) transactions through its transaction log, ensuring reliable writes and reads. Additionally, it supports efficient data versioning and time travel queries, enabling users to access historical snapshots of data effortlessly. Delta Lake also handles schema evolution gracefully, allowing changes to schemas without disrupting existing data pipelines. For organizations deeply invested in Databricks, Delta Lake offers a polished, unified experience with optimized performance through features like Z-order indexing and data skipping, which significantly accelerate query performance on large datasets.

Storage optimization is key

Ordering and partitioning are critical for optimizing performance in open table formats. Partitioning organizes data by key fields (like date or region), reducing the amount of data scanned during queries, while ordering ensures data is sorted within files, enhancing features like data skipping, indexing, and compression. Together, they significantly boost query speed, lower costs, and make operations like upserts and incremental reads far more efficient—especially at scale.

Apache Iceberg: The Engine-Agnostic Unifier

Apache Iceberg was designed from the ground up to be engine-agnostic, focusing on absolute correctness and predictable performance, regardless of data scale. Its core innovation is tracking data at the individual file level rather than just the directory level. This approach solves long-standing issues with partition evolution, allowing tables to evolve schemas and partitions smoothly without expensive rewrites or downtime. Iceberg maintains a comprehensive metadata layer stored as Apache Avro, JSON, or Protocol Buffers, which provides fast query planning and schema validation. Furthermore, Iceberg enables advanced snapshot isolation, incremental data scans, and time-travel queries, significantly enhancing its usability for complex analytics scenarios. Its neutrality and robustness have led to extensive support across the industry, with major players like AWS, Snowflake, Google, and Dremio embracing it as a first-class format. As industry analyst Kai Waehner notes, Iceberg is rapidly becoming the standard “open table format for Lakehouse AND Data Streaming.”

Apache Hudi: The Streaming Specialist

Apache Hudi (short for “Hadoop Upserts Deletes and Incrementals”) excels where others were initially less focused: high-throughput, low-latency data ingestion. It was specifically built to manage the demands of Change Data Capture (CDC) and streaming data directly from databases and event logs into data lakes. Hudi offers advanced capabilities like incremental queries, allowing efficient retrieval of recent changes without scanning entire datasets. It supports record-level indexing through its built-in indexing mechanisms, greatly improving performance for updates and deletes on large-scale datasets. Hudi also provides sophisticated file-sizing and compaction optimizations to maintain optimal performance and storage efficiency. With its emphasis on streaming use cases, it integrates seamlessly with popular streaming engines like Apache Flink and Apache Kafka, making it an excellent choice for real-time analytics and fast data availability requirements.

Apache Paimon: The Hybrid Analytical and Transactional Format

I know, this wasn’t on the list, but this recent newcomer deservers an entry in this article. Although it could be seen as quite a niche player in my opinion it solves some conceptual questions, Apache Paimon (formerly known as Flink Table Store) is designed explicitly to bridge analytical and transactional workloads within a unified table format. Its defining characteristic is the native support for hybrid transactional/analytical processing (HTAP), enabling it to handle both real-time updates and large-scale batch analytics efficiently. Paimon utilizes an optimized columnar storage format, offering excellent compression and query performance. It supports rich metadata management, schema evolution, and powerful indexing strategies such as Bloom filters and sorted runs, significantly accelerating point lookups and range queries. Another notable feature is its support for incremental snapshots and continuous compaction, which keep data fresh and query latency consistently low. Apache Paimon integrates deeply with Apache Flink for streaming workloads and Apache Spark for batch analytics, making it attractive for organizations that require versatility and robust transactional capabilities within a single, cohesive platform.

How to Choose the Right Format for Your Business

There is no single “best” open table format. The optimal choice is entirely dependent on your organization’s context, priorities, and technical landscape. To navigate this decision, you shouldn’t ask “Which format is better?” but rather “Which format is the best fit for us?” This framework provides a series of strategic questions to guide your evaluation.

Key Questions for Your Team:

What is your primary workload? Are you focused on large-scale batch analytics and reporting, where consistency is paramount? Or is your priority real-time data ingestion from operational databases via Change Data Capture (CDC)? Or perhaps a mix of both?
What is your existing (or target) ecosystem? Is your organization heavily invested in the Databricks platform and Apache Spark? Or are you building a multi-engine strategy that requires interoperability between tools like Trino, Flink, and Snowflake? Is your cloud strategy centered on a specific provider like AWS?
What is your team’s expertise? Does your team have deep, specialized knowledge in Spark? Or is it composed of general SQL analysts who need a simple, reliable interface to the data?
What is your highest priority? Are you optimizing for simplicity and ease of use within a unified platform? Or is your primary goal to maintain maximum flexibility and avoid vendor lock-in? Or are you pushing the boundaries with cutting-edge streaming capabilities?

Use Case Mapping: A Practical Guide

Based on the answers to those questions, you can map your needs to the most suitable format:

If you are… heavily invested in the Databricks platform and your primary workloads are batch and streaming analytics within that ecosystem, …then consider Delta Lake. Its tight integration provides a seamless and powerful user experience.
If you are… building a future-proof, multi-engine data architecture and want to avoid vendor lock-in while ensuring long-term interoperability, …then consider Apache Iceberg. Its broad industry support and engine-agnostic design make it the safest bet for flexibility.
If you are… focused on real-time CDC, incremental data processing, and building streaming pipelines from operational databases into the lake, …then consider Apache Hudi. Its specialized features for write-heavy, low-latency workloads give it a distinct advantage in this domain.
If you, just don’t know, my piece of advice is to go for Iceberg, as it is becoming more and more popular (to the point of almost being considered as a de-facto standard).

Looking ahead, what does the future hold for these open table formats and the lakehouse architecture?

Will One Format Rule Them All?

The current landscape is often framed as the “format wars,” but the reality is more nuanced. While there is healthy competition, each format has carved out a strong position, and we are likely heading towards a multi-format world for the foreseeable future. However, some clear trends are emerging.

Apache Iceberg’s momentum is undeniable. Its design philosophy of neutrality and correctness has resonated across the industry, leading to widespread adoption from nearly every major data vendor outside of Databricks. This broad coalition makes Iceberg the leading candidate to become the de facto open standard for interoperability—the “lingua franca” of the lakehouse that allows different engines to communicate with the same data.

At the same time, Delta Lake’s position is deeply entrenched. With a massive existing user base via Databricks and a rich, mature feature set, it will remain a dominant force, particularly within its native ecosystem. Databricks is also pushing for broader adoption with the open-sourcing of the entire Delta Lake specification.

Apache Hudi will continue to thrive as a powerful specialist. For the high-value, high-complexity use cases around real-time data ingestion and CDC, its unique capabilities give it a durable competitive advantage.

Ultimately, the future may not be about one format winning but about peaceful coexistence and interoperability. The emergence of tools like Onetable, which can translate between formats, suggests a future where organizations can leverage the best format for a specific job without being locked into a single choice.

Ultimately, the power is in your hands to choose the right path for your data strategy.

Final remarks: Your Data, Your Rules

We’ve moved beyond the false choice between the rigid data warehouse and the chaotic data lake. The modern Lakehouse, enabled by the separation of storage and compute, offers a path to a more flexible, scalable, and cost-effective data architecture. Open table formats like Delta Lake, Apache Iceberg, and Apache Hudi are the critical technology that makes this possible, bringing reliability and performance to your data lake. The choice between them is not merely technical but deeply strategic, impacting your organization’s agility, costs, and future options.

The growth of this market is a testament to its value. The data lake market was valued at 5.80 billion USD in 2022 and is projected to grow significantly, reaching 34.07 billion USD by 2030 (Fortune Business Insights). The benefits are tangible. A 2024 survey from Dremio on the state of the data lakehouse found that over half (56%) of organizations expect to save more than 50% on their total cost of ownership for analytics by moving to this architecture. The opportunity is clear.

Start the conversation internally. Use the framework in this article to evaluate your workloads, your ecosystem, and your strategic priorities. By doing so, you can take the first, most important step toward building a flexible, future-proof data architecture that truly serves the needs of your business. Your data, your rules.

David Rey

Explorer

A short overview on Lakehouse open formats

Why Separating Storage and Compute Changes Everything

What Are Open Table Formats, Really?

Meet the Contenders

Delta Lake: The Mature Choice for the Spark Ecosystem

Apache Iceberg: The Engine-Agnostic Unifier

Apache Hudi: The Streaming Specialist

Apache Paimon: The Hybrid Analytical and Transactional Format

How to Choose the Right Format for Your Business

Will One Format Rule Them All?

Final remarks: Your Data, Your Rules

Graph View

Table of Contents

Latest Posts

From sandboxed to boardroom

Hybrid crews

The microshift revolution

Supply chain copilots

Opportunity or Squeeze