Open Table Formats: Iceberg, Delta Lake, and Hudi — The War Nobody Told Your Data Team About

Somewhere around 2016, the data engineering world collectively realized that storing data as flat files in S3 was a great idea but came with a catastrophic flaw: you couldn't update a row. You could append. You could overwrite entire partitions. But GDPR showed up, users wanted their data deleted, and "sorry, we'd need to rewrite 50 TB of Parquet" wasn't an acceptable answer.

Three separate engineering teams at Netflix, Databricks, and Uber reached the same conclusion at roughly the same time: what if we added a metadata layer on top of object storage that gave you ACID transactions, schema evolution, time travel, and row-level updates — without a server? What if the "database" was just a set of files and a spec for reading them?

The result was Apache Iceberg, Delta Lake, and Apache Hudi — three open table formats that spent several years trying to kill each other and are now converging in ways nobody predicted. This is the story of what they actually are, how they work internally, and how to pick one without regretting it two years from now.

Why This Matters: The Hive Problem

Before open table formats, the standard approach to analytical data on object storage was Hive tables — a metadata store that tracked which Parquet files belonged to which partition. It worked, barely, until it didn't. Hive tables had no concept of atomic transactions: two writers could corrupt each other's output. Schema changes were painful and often required rewriting the entire dataset. There was no time travel — once you overwrote a partition, the old data was gone. And reading a table required listing all the files in all the partitions, which on S3 means expensive and slow list API calls that scale linearly with the number of partitions.

Open table formats solve all of these problems with a metadata layer — a set of manifest files that tell readers exactly which data files constitute a snapshot of the table, without directory listing. This is the core architectural insight: tracking files via metadata rather than inferring them from directory structure. It sounds simple. The consequences are profound.

Apache Iceberg: The Spec-First Format

Netflix open-sourced Iceberg in 2018. The defining characteristic of Iceberg isn't any particular feature — it's that Iceberg is a specification first, implementation second. The spec defines exactly how metadata must be structured, and any engine that implements the spec can read any Iceberg table written by any other engine. This portability is what's driven Iceberg's adoption in 2025: it's genuinely format-neutral.

Metadata Architecture

Every Iceberg table has a three-level metadata hierarchy:

Table metadata file — the entry point, a JSON file that points to the current snapshot and all historical snapshots. This is what gets updated atomically when you commit a transaction.
Manifest list — one per snapshot, lists all the manifest files that together describe the snapshot's data files.
Manifest files — each manifest file lists a subset of the data (Parquet) files, along with column-level statistics: min/max per column, null count, row count. These statistics are what enable partition pruning without reading any data files.

flowchart TD
    Catalog["Iceberg Catalog\n(Glue / Nessie / REST)\nCurrent metadata pointer"] --> MetaJSON

    subgraph Metadata["Metadata Layer (JSON files in S3)"]
        MetaJSON["table-metadata-v3.json\nSchema, partition spec, snapshots list"]
        ManifestList["manifest-list-snap-001.avro\nOne entry per manifest file"]
        Manifest1["manifest-file-A.avro\nData files + column stats\n(partition group A)"]
        Manifest2["manifest-file-B.avro\nData files + column stats\n(partition group B)"]
        MetaJSON --> ManifestList
        ManifestList --> Manifest1
        ManifestList --> Manifest2
    end

    subgraph Data["Data Layer (Parquet files)"]
        P1["data-000001.parquet"]
        P2["data-000002.parquet"]
        P3["data-000003.parquet"]
        P4["data-000004.parquet"]
        Manifest1 --> P1 & P2
        Manifest2 --> P3 & P4
    end

    subgraph Deletes["Delete Files (v2 spec)"]
        PD["position-delete-file.parquet\nfile path + row offset"]
        ED["equality-delete-file.parquet\ncolumn value = deleted row id"]
        Manifest1 -.->|"associated"| PD
        Manifest2 -.->|"associated"| ED
    end

Iceberg's three-level metadata hierarchy. Queries scan only the manifests whose column statistics overlap the query predicate — never the underlying data files unless necessary. Delete files (spec v2) enable merge-on-read without rewriting data.

The critical optimization here is that readers can prune at every level. If your query filters on event_date = '2025-06-01', Iceberg checks the partition specification to skip entire manifests, then checks per-file column statistics to skip individual Parquet files. For a well-partitioned table, a query touching one day of data might read 0.01% of the physical files. This is called manifest-level pruning, and it's why Iceberg queries are often dramatically faster than equivalent Hive queries on the same data.

Concurrency Control: Optimistic and Atomic

Iceberg's concurrency model is optimistic: multiple writers can prepare commits simultaneously, and the commit itself is a single atomic CAS (compare-and-swap) operation on the catalog entry — "change the current metadata pointer from version N to version N+1." If two writers try to commit simultaneously, one wins and one retries. On S3 with a catalog like AWS Glue or Apache Nessie, this is both safe and lock-free. On S3 without a catalog (using the legacy S3 file system catalog), it requires careful configuration but remains safer than Hive.

Iceberg Spec v2 added row-level deletes through delete files rather than data file rewrites. When you delete a row, Iceberg writes a small "delete file" that records which rows to ignore during reads. This is a merge-on-read approach: readers apply the deletes at query time, which is slightly more CPU work but vastly cheaper than rewriting a 10 GB Parquet file to remove three rows. Spec v3 (in progress as of 2025) replaces position delete files with deletion vectors — compact bitmaps representing deleted row positions — borrowed from Delta Lake's design.

Delta Lake: The Databricks Native

Delta Lake was open-sourced by Databricks in 2019. If Iceberg is spec-first, Delta is implementation-first: it was built to work brilliantly with Spark, and it does. The trade-off is that Delta's design decisions reflect Spark's architecture in ways that occasionally show when using other engines.

Transaction Log Architecture

Delta Lake's metadata is a transaction log stored in _delta_log/ at the table root. Each committed transaction writes a numbered JSON file: 000000000000000000001.json, 000000000000000000002.json, and so on. These files record what was added, what was removed, and what statistics were gathered. Every 10 commits (by default), Delta writes a Parquet checkpoint file that consolidates all preceding JSON log entries — reading the checkpoint is much faster than replaying thousands of individual JSON files.

This design is elegantly simple and gives Delta native time travel: to read the table at any point in history, just replay the log up to that transaction. The downside is the JSON log files themselves — on high-throughput tables with thousands of small transactions, the log can grow into tens of thousands of files, making log replay slow and increasing the metadata overhead. Aggressive checkpointing and log compaction are operational necessities for busy Delta tables.

The S3 Locking Problem

Here's the thing nobody mentions until you're in production: Delta Lake on S3 requires a locking service for safe concurrent writes. S3 doesn't provide atomic rename operations (unlike HDFS), so Delta uses an external lock table — DynamoDB by default with the AWS Delta connector — to serialize concurrent writers. Without this, two simultaneous writes can corrupt the log.

The hidden operational cost: Every Delta write on S3 touches DynamoDB for lock acquisition and release. At 100 writes/second, that's 200 DynamoDB operations/second per Delta table — plus the write units for the log entries themselves. On very high-throughput tables, DynamoDB costs can rival storage costs. Iceberg avoids this with its CAS-based catalog model; Hudi uses its own optimistic concurrency mechanism. Plan your cost model accordingly.

Delta Lake's strength is Spark integration depth. Auto-optimize, auto-compaction, Z-Order clustering, data skipping via column statistics, VACUUM for old snapshot cleanup — all of these are polished and battle-tested. If your entire stack is Databricks, Delta Lake is the obvious choice. You get excellent performance, great tooling, and a team with billions of investment dollars maintaining it.

Apache Hudi: Updates as a First-Class Citizen

Hudi (Hadoop Upserts Deletes and Incrementals) was built by Uber in 2016 and open-sourced in 2019. While Iceberg and Delta started from the "how do we make big reads fast?" angle, Hudi started from "how do we update individual records efficiently?" — because Uber had 10 billion trip records and needed GDPR deletes to complete in minutes, not days.

Copy-on-Write vs Merge-on-Read

Hudi gives you an explicit choice between two write modes:

Copy-on-Write (CoW): When you update or delete a row, Hudi rewrites the entire Parquet file containing that row. Reads are fast — there are no delta files to merge. Writes are expensive because file rewrites are expensive. Good for tables with infrequent updates and read-heavy workloads.
Merge-on-Read (MoR): Updates are written to small Avro delta log files alongside the base Parquet files. Reads merge the base files with the delta logs at query time. Writes are fast; reads are slightly slower and more complex. Good for tables with frequent updates and near-real-time ingestion requirements.

The MoR approach is where Hudi genuinely shines over competitors. For CDC (Change Data Capture) ingestion from transactional databases — think Debezium writing Kafka events, Hudi consuming them in micro-batches — MoR enables near-real-time landing of updates with manageable write amplification. Uber, Amazon, Walmart, and Robinhood all run Hudi at scale for exactly this pattern.

Hudi 1.0 (released January 2025) added native Iceberg output — a Hudi table can now expose itself as an Iceberg table via catalog integration, giving you Hudi's write performance with Iceberg's read ecosystem compatibility. This is a significant architectural convergence that's worth tracking.

Head-to-Head Comparison

Dimension	Apache Iceberg	Delta Lake	Apache Hudi
Origin	Netflix (2018)	Databricks (2019)	Uber (2016, OSS 2019)
Metadata format	JSON + Avro manifests	JSON log + Parquet checkpoints	Timeline (JSON) + Avro logs
Concurrency on S3	CAS via catalog, lock-free	Requires external lock (DynamoDB)	Optimistic concurrency, OCC
Row-level deletes	Delete files (spec v2) / DVs (v3)	Deletion vectors (Delta 3.0+)	Native MoR delta logs
Read performance	Excellent (manifest pruning)	Excellent (Z-Order, data skipping)	Good CoW; MoR adds merge cost
Write performance	Good (copy-on-write by default)	Good (auto-optimize helps)	Excellent for updates (MoR)
Engine support	Spark, Flink, Trino, DuckDB, Snowflake, BigQuery	Spark, Flink, Trino (via Delta connector)	Spark, Flink (Trino support improving)
Schema evolution	Full: add/rename/reorder/widen	Add columns, limited other changes	Add columns, evolution with caveats
Governance / catalog	REST catalog, Nessie, Unity, Polaris	Unity Catalog (Databricks), HMS	HMS, Hive Metastore
Industry momentum 2025	Winning new adoptions; AWS/GCP/Snowflake native	Dominant at existing Databricks customers	Strong in CDC/streaming use cases

How They Fit Into a Modern Data Stack

flowchart LR
    subgraph Sources["Data Sources"]
        DB["Transactional DB\n(Postgres/MySQL)"]
        Events["Event Streams\n(Kafka)"]
        APIs["SaaS APIs\n(Salesforce, etc.)"]
    end

    subgraph Ingest["Ingestion Layer"]
        Debezium["Debezium CDC"]
        Flink["Apache Flink\nStreaming"]
        Airbyte["Airbyte / Fivetran\nBatch"]
    end

    subgraph Lake["Data Lake (Object Storage)"]
        Bronze["Bronze Layer\nRaw / append-only\nHudi MoR or Iceberg"]
        Silver["Silver Layer\nCleaned + conformed\nIceberg or Delta"]
        Gold["Gold Layer\nAggregate / serving\nIceberg or Delta"]
    end

    subgraph Compute["Query / Transform"]
        Spark["Apache Spark"]
        Trino["Trino / Athena"]
        dbt["dbt (incremental models)"]
        Airflow["Airflow DAGs\n(orchestration)"]
    end

    subgraph Serve["Serving / Consumers"]
        BI["BI Tools\n(Power BI, Tableau)"]
        DS["Data Science\n(Notebooks)"]
        Apps["Operational\nApplications"]
    end

    DB --> Debezium --> Flink --> Bronze
    Events --> Flink
    APIs --> Airbyte --> Bronze
    Bronze --> Silver --> Gold
    Spark & Trino & dbt --> Silver & Gold
    Airflow -.->|"orchestrates"| dbt & Spark
    Gold --> BI & DS & Apps

Open table formats as the persistence layer across a medallion architecture. Hudi's MoR excels at the Bronze ingest layer for CDC workloads; Iceberg and Delta are typically preferred for Silver/Gold where read performance dominates.

Real-Time and Streaming: The Flink Story

One of the more exciting developments in 2024–2025 is first-class Flink support for all three formats. Flink can write to Iceberg, Delta, and Hudi tables in mini-batch mode (every 1–5 minutes), effectively bringing near-real-time data into analytical tables without a separate Lambda architecture.

The streaming write story differs between formats:

Iceberg + Flink: Flink uses Iceberg's streaming writer API that batches records into row groups and commits snapshots atomically. Kafka topic → Iceberg table with 60-second latency is a standard production pattern at companies like Netflix and Apple. The Iceberg spec's multi-table transaction support (in progress) will enable atomic cross-table commits — important for maintaining referential consistency in streaming ETL.
Delta + Flink: Flink-Delta connector is maintained by Delta's open-source contributors. Works well but historically lagged behind the Spark connector in feature parity. In 2025, the connector has caught up significantly. Still worth validating specific features (change data feed, schema evolution) before committing to this stack.
Hudi + Flink: This is arguably Hudi's strongest story. Hudi's MoR mode was designed for streaming upserts, and Flink + Hudi is a mature, production-tested combination used by Uber and others. The Hudi Flink writer handles late-arriving records, deduplication, and partition management automatically — things you'd otherwise build yourself.

Confluent's Tableflow service (launched 2024) automatically mirrors Kafka topics to Iceberg tables, managed entirely in the Confluent Cloud platform. For teams already on Confluent, this eliminates the operational burden of managing Flink streaming jobs just to land events in object storage.

Cloud Platform Integration

AWS

S3 Tables (2024): Native Iceberg tables managed in S3, accessed via standard Iceberg REST API. Includes automatic compaction, snapshot management, optimistic concurrency — no DynamoDB lock service needed.
Glue Data Catalog: Iceberg catalog for cross-service access (Athena, EMR, Glue ETL). Delta support via Glue connector with DynamoDB locking.
Athena: Native Iceberg reads with partition pruning and time travel. Write support via CTAS and INSERT INTO.
EMR: All three formats supported; Iceberg and Delta have first-class connectors in EMR 6.x+.

Azure / Microsoft Fabric

ADLS Gen2: Underlying storage for all formats. Delta Lake is the native format for Microsoft Fabric (OneLake).
Microsoft Fabric: Lakehouse uses Delta Parquet natively. Iceberg support via shortcuts to external ADLS paths.
Azure Databricks: Full Delta Lake + Unity Catalog stack; best-in-class Delta experience on Azure.
Azure Synapse Analytics: Delta and Parquet support via Spark pools; Iceberg support in external tables.

GCP

BigQuery managed Iceberg (GA 2024): Iceberg tables natively in BigQuery, read via BigQuery SQL and external Iceberg engines. Includes auto-compaction and lifecycle management.
Dataproc: Spark on GCS; all three formats supported. Iceberg Catalog via BigQuery Metastore.
Dataplex: Data governance layer with Iceberg table discovery and lineage tracking.
Cloud Storage: Standard GCS buckets as data lake storage for all formats.

Databricks & Snowflake

Databricks + Unity Catalog: Delta Lake as primary format; Iceberg read/write support via Delta UniForm (a Delta table that exposes Iceberg metadata). Tabular acquisition ($1B+, 2024) brought Iceberg catalog leadership in-house.
Snowflake Iceberg Tables (GA 2024): Iceberg tables with Snowflake as external catalog. Customer-managed storage in S3/GCS/ADLS; Snowflake provides the compute and catalog. Cost model differs from managed Snowflake tables.
Snowflake + dbt: Iceberg tables as dbt incremental model targets. Requires configuring the Iceberg table properties in dbt model configs.

dbt Integration: Incremental Models on Open Table Formats

dbt's incremental model strategy maps directly to the open table format's merge/upsert capability. On Iceberg and Delta Lake, you can use the merge incremental strategy to upsert new and changed rows efficiently:

-- models/silver/customer_profiles.sql
{{
  config(
    materialized='incremental',
    incremental_strategy='merge',
    unique_key='customer_id',
    file_format='iceberg',
    partition_by=[{'field': 'updated_date', 'data_type': 'date'}],
    properties={
      "write.target-file-size-bytes": "134217728",  -- 128 MB target
      "write.parquet.compression-codec": "zstd"
    }
  )
}}

select
    customer_id,
    email,
    country,
    cast(updated_at as date) as updated_date,
    updated_at
from {{ source('bronze', 'customers_raw') }}

{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}

The merge strategy on Iceberg generates a proper MERGE INTO SQL statement that the Iceberg engine executes atomically. This is dramatically better than the append + deduplication pattern that pre-Iceberg dbt users had to implement manually. For Hudi, dbt support requires the dbt-hudi adapter which has reached production maturity in 2025 — though it's still less widely used than dbt-spark with Iceberg/Delta.

dbt + Iceberg small files problem: dbt generates one file per dbt run partition unless you configure compaction. After 30 days of daily runs, a table partitioned by date will have 30 small files per date partition — exactly the small files problem Iceberg is designed to solve, recreated by your orchestration pattern. Set up a compaction job (via CALL iceberg.system.rewrite_data_files() in Spark SQL or use Flink's background compaction) to coalesce files after dbt runs. Aim for 128 MB–512 MB Parquet files; smaller is a query performance tax.

Common Problems and How to Avoid Them

1. Small Files Proliferation

This is the most common operational problem with all three formats. Streaming writers, frequent micro-batch commits, and partition-per-day strategies all generate small files. A table with 100,000 files of 1 MB each reads slower than the same data in 100 files of 1 GB each, because metadata enumeration overhead dominates.

Fix: Schedule regular compaction. For Iceberg: CALL iceberg.system.rewrite_data_files(table => 'db.table', strategy => 'sort', sort_order => 'zorder(user_id, event_date)'). For Delta: enable spark.databricks.delta.autoCompact.enabled = true. For Hudi: configure hoodie.compact.inline=true for MoR tables. Aim to run compaction after every 10–20 streaming commits.

2. Too Many Snapshots / Log Growth

Every commit creates a new snapshot (Iceberg/Hudi) or log entry (Delta). Without cleanup, metadata can grow to millions of files and metadata reads become the bottleneck. This is particularly severe on Iceberg tables with frequent streaming writes — after a week of 5-minute commits, you have 2,016 snapshots and their associated manifest files.

Fix: Run expire_snapshots regularly. For Iceberg: CALL iceberg.system.expire_snapshots(table => 'db.table', older_than => TIMESTAMP '2025-05-01 00:00:00'). For Delta: VACUUM delta.`/path/to/table` RETAIN 168 HOURS. Keep at least 7 days of history for time travel; expire beyond that.

3. Partition Evolution Nightmares

Changing partition strategy mid-table is painful in all formats. If you start with daily partitioning and later need hourly, you have two options: rewrite the entire table (expensive) or live with mixed partition granularity (confusing for query planners). Iceberg's partition evolution (spec-level feature) handles this more gracefully than Delta or Hudi — it tracks which partition spec each data file was written with, so the reader knows how to apply each file's partition information. But it still generates operational complexity.

Fix: Think carefully about partition strategy before writing the first row. For event data: partition by event_date (day granularity), not by timestamp. For dimension tables: don't partition at all unless the table exceeds 100 GB. Wrong partitioning decisions are expensive to undo.

4. Catalog Sprawl

Iceberg especially suffers from catalog fragmentation: you might have the same table registered in Glue, a Hive Metastore, and a REST catalog simultaneously — and they can drift. Table metadata updates through one catalog path won't be visible through another until resync.

Fix: Pick one catalog per environment and treat it as authoritative. In 2025, the REST catalog spec (Apache Polaris / Project Nessie / Databricks Unity Catalog) is becoming the standard interface. Federate access through a single catalog endpoint rather than registering tables in multiple systems.

Data Governance and Data Quality

Open table formats enable governance capabilities that were difficult or impossible with flat Hive tables:

Column-level lineage: Because every transaction is logged with file-level and column-level statistics, lineage tools (OpenLineage, Apache Atlas, DataHub) can track which source data contributed to which output table and which columns were involved. This is table-format-agnostic but works best with Iceberg's rich metadata.
Table versioning as audit trail: Iceberg's snapshot history gives you a full audit log of who wrote what data when. For regulated industries (financial services, healthcare), this is directly usable as a data change audit trail without additional tooling.
Schema enforcement: All three formats support schema enforcement on write — rejecting records that don't match the registered schema. This catches data quality issues at ingestion time rather than at query time. Combine with Great Expectations or dbt tests for column-level quality checks after each write.
GDPR right-to-erasure: Row-level deletes (Iceberg delete files, Delta deletion vectors, Hudi MoR logs) make GDPR-compliant deletion feasible at scale. Issue a DELETE statement targeting the customer's rows, then compact to physically remove the data. Track deletion jobs in your compliance system.

flowchart LR
    subgraph Governance["Data Governance Layer"]
        direction TB
        Catalog["Iceberg REST Catalog\n(Apache Polaris / Nessie)\nSchema registry + access control"]
        Lineage["OpenLineage Collector\nColumn-level lineage events"]
        DQ["Data Quality\n(Great Expectations / dbt tests)\nPost-write validation"]
        Audit["Snapshot History\nAudit trail: who wrote what, when"]
    end

    subgraph Platform["Data Platform"]
        Writers["Flink / Spark / dbt\nWrite transactions"] --> IceTable["Iceberg Table\n(manifests + Parquet)"]
        IceTable --> Readers["Athena / Trino / BI Tools\nRead via catalog"]
    end

    Writers -->|"lineage events"| Lineage
    Writers -->|"register schema"| Catalog
    IceTable -->|"post-write check"| DQ
    IceTable -->|"snapshot log"| Audit
    Catalog -->|"access policy"| Readers

Governance as a cross-cutting concern around the table format layer. The catalog enforces access control; lineage is emitted by writers; data quality runs post-commit; the snapshot log provides the audit trail.

The 2025 Industry Landscape: Who's Winning?

The honest answer is: Iceberg is winning for new projects, Delta Lake is deeply entrenched at existing Databricks customers, and Hudi maintains a strong position in CDC/streaming use cases.

The biggest 2025 signal was Databricks acquiring Tabular (the company founded by Iceberg's original authors at Netflix) for over $1 billion. The stated goal was to bring Iceberg and Delta closer together through Delta UniForm — a format compatibility layer where a Delta table transparently exposes Iceberg metadata. Whether this convergence genuinely arrives or remains a marketing story is the most interesting question in the data engineering space right now.

AWS's launch of S3 Tables (native Iceberg with automatic management) is a strong signal that Amazon is betting on Iceberg as the default open format. GCP's BigQuery managed Iceberg, Snowflake's Iceberg Tables GA, and Confluent Tableflow all pointed the same direction. For new cloud-native data stacks started in 2025, Iceberg is the default unless you have a specific reason to choose otherwise.

That said, "picking the winner" is somewhat academic if you're a Databricks-heavy shop. Delta Lake's performance, tooling, and Unity Catalog integration are genuinely excellent. The switching cost of migrating a 50-table Delta lakehouse to Iceberg is significant, and Delta UniForm is a plausible path to ecosystem interoperability without a rewrite.

How to Choose

New project on AWS, GCP, or multi-cloud: Apache Iceberg. Native support from every major cloud, engine, and tool. The REST catalog spec is maturing. You won't regret it.
Existing Databricks investment: Delta Lake. The tooling (auto-optimize, Z-Order, Unity Catalog) is excellent. Consider UniForm for cross-engine reads if Trino or non-Databricks Spark access is needed.
CDC / near-real-time upserts are your primary pattern: Apache Hudi MoR, especially with Flink. The architecture was designed for this. After Hudi 1.0, you can also expose Iceberg metadata for read compatibility.
Already on Snowflake: Snowflake Iceberg Tables give you the open format benefits (customer-managed storage, cross-engine access) while keeping Snowflake's SQL engine and governance. Useful for cost control on large cold datasets.

The worst mistake isn't picking the "wrong" format — it's not picking any format and defaulting back to raw Hive tables because the decision felt too hard. Raw Hive tables are why people end up with GDPR deletion backlogs, stale partition data, and incremental refresh pipelines held together with shell scripts. Any of the three formats is a vast improvement over the status quo.

Pick one. Understand its internals well enough to set up compaction and snapshot expiry properly. Run it for six months. Then you'll have informed opinions about where it falls short for your specific workload — and those opinions will be worth more than any comparison table including this one.