← Back to Blog

State of Data Engineering 2024: Open Catalogs, DuckDB Everywhere, and AI Infiltrates the Stack

2024 was the year that open won. Unity Catalog went open-source. Databricks acquired Tabular (the company founded by the original Iceberg creators from Netflix) and immediately committed to supporting Iceberg alongside Delta. Apache Iceberg's REST Catalog spec became the closest thing to an industry standard for interoperable table format metadata. After years of watching vendor lock-in consolidate around proprietary catalog implementations, 2024 felt like a genuine shift toward openness in data infrastructure.

It was also the year AI-assisted tooling went from "cool demo" to "actually changing how I work." Not because LLMs replaced data engineers — they didn't — but because AI-assisted SQL generation, dbt YAML generation, and pipeline debugging became genuinely useful parts of the daily workflow for engineers who adopted them.

The Tabular Acquisition: Databricks Bets on Iceberg (Too)

In June 2024, Databricks acquired Tabular for a reported $2 billion — extraordinary valuation for a company that had only raised $60M. Tabular was founded by Ryan Blue, Daniel Weeks, and Jason Reid, the original creators of Apache Iceberg while at Netflix. Their product was a managed Iceberg catalog and lakehouse service — essentially a neutral, open-source-first alternative to Databricks and Snowflake's proprietary catalog offerings.

The strategic logic was clear: Databricks had bet on Delta Lake, but the market was increasingly choosing Iceberg for multi-engine scenarios. Rather than fight that trend, Databricks acquired the company most associated with Iceberg's success and committed to first-class Iceberg support in Unity Catalog, Databricks Runtime, and Delta UniForm (a compatibility layer that lets Delta tables be read as Iceberg).

Delta UniForm: Starting in Databricks Runtime 13.3, you can enable Iceberg metadata alongside Delta metadata on the same table. Trino, Presto, or Spark can read the same table as Iceberg while Databricks treats it as Delta. It's not zero-overhead — there's a metadata sync cost on each commit — but it's a pragmatic solution to the format fragmentation problem for multi-engine shops.

Unity Catalog Goes Open-Source

September 2024: Databricks open-sourced Unity Catalog under Apache 2.0 license. The open-source UC exposes a REST API for catalog operations, making it possible to use UC as a governance layer for Spark, Trino, DuckDB, and other engines without a Databricks license.

This was significant for two reasons. First, it meant the Unity Catalog data model (metastore → catalog → schema → table, with ABAC and column-level security) could become an industry standard rather than a vendor moat. Second, it enabled hybrid architectures: a company could run open-source UC for catalog metadata while using Databricks for heavy Spark workloads and Trino for interactive queries — all governed by the same catalog.

The open-source UC at launch was behind the managed service in features (no lineage capture, limited ABAC). But the direction was clearly toward convergence.

DuckDB 1.0: Graduation Day for In-Process Analytics

January 2024: DuckDB released version 1.0.0. The API stability guarantee that came with 1.0 was important for production adoption. DuckDB's usage patterns by the end of 2024:

  • Local development: Standard practice for dbt development — run transformations against a sample dataset locally with zero cloud costs before pushing to Snowflake/BigQuery
  • Embedded analytics: Applications embedding DuckDB to query Parquet files on S3 directly, without a separate data warehouse
  • MotherDuck: Managed DuckDB with serverless scaling and notebook interface. Serious option for teams with <500GB of data who don't need Snowflake's scale
  • ETL pipelines: dbt-duckdb as a lightweight transformation engine for pipelines that don't need distributed compute
  • Data science: DuckDB as the query engine behind Polars and PyArrow operations, dramatically faster than pandas for aggregation workloads

DuckDB didn't replace cloud data warehouses in 2024. But it established a ceiling on the minimum complexity you need for analytics workloads below a certain scale. If your data fits in a few hundred gigabytes, there's a serious question about whether you need Snowflake at all.

AI-Assisted SQL and the "Natural Language to Data" Pipeline

Every major BI and data catalog tool shipped AI features in 2024. Tableau Pulse, Looker Conversational Analytics, Power BI's Copilot, Snowflake Cortex Analyst, BigQuery Duet AI — all took slightly different approaches to the same problem: making data accessible to non-technical users through natural language.

The honest assessment at year end: AI-assisted SQL was genuinely useful for simple analytical queries (aggregations, filters, basic joins) and significantly less reliable for complex multi-table queries, time-series analysis, or anything requiring domain-specific business logic. The failure mode was subtle — the generated SQL often looked plausible but was semantically wrong in ways that were hard for non-technical users to catch. Semantic layers (dbt Semantic Layer, Cube.js) were critical as a guardrail: if you forced LLM-generated queries to go through a validated metric definition layer, the accuracy improved dramatically.

Streaming Lakehouse: The Architecture Matures

The "streaming lakehouse" pattern — combining a streaming ingestion layer (Kafka, Kinesis, Flink) with an open table format that supports upserts and time travel — became a more mature and widely deployed architecture in 2024. Apache Flink's SQL surface and connector ecosystem improved substantially. Confluent's Tableflow feature (streaming Kafka topic data directly into Iceberg tables) removed the custom Flink job between Kafka and storage for many use cases.

The practical stack that emerged for streaming-first architectures:

  • Kafka (MSK, Confluent, Redpanda) for streaming ingestion
  • Flink SQL for stateful stream processing and CDC processing
  • Iceberg or Delta (with MERGE INTO / upsert support) as the storage layer
  • dbt for batch transformation on top of the landing tables
  • Unity Catalog or AWS Glue for catalog and governance

The Consolidation That Didn't Happen (Yet)

Everyone expected the data tooling market to consolidate in 2024 after the 2022–2023 funding boom. It happened partially but not as fast as expected. A few mid-tier companies shut down or got acqui-hired; several raised down rounds. But the major platforms (Snowflake, Databricks, dbt Labs, Fivetran) continued investing, and new entrants in adjacent spaces (streaming, AI pipelines, data observability) kept raising.

The categories that showed signs of consolidation: data observability (Monte Carlo and a few large players are winning), orchestration (Airflow + Dagster + Prefect splitting the market more cleanly), and reverse ETL (Census and Hightouch dominating). The categories that remained fragmented: vector databases (still 15+ serious players), data catalogs (lots of enterprise products, no clear winner), and the semantic/metrics layer (dbt Semantic Layer gaining ground but not dominant).

Going into 2025, data engineering looked like a mature industry with some genuinely exciting technical progress (open catalogs, streaming lakehouse, AI-assisted tooling) layered on top of a stable foundation (cloud warehouses, dbt, Airflow variants). The exciting part: agent-driven data engineering was starting to seem less like science fiction and more like a 2025 project.