← Back to Blog

State of Data Engineering 2023: The AI Earthquake and the Format Wars

2023 hit data engineering like a truck going two directions at once. On one side: the long-running story of cloud data warehouses, dbt, and the modern data stack continued to mature, with real consolidation happening in the tools landscape. On the other side: the ChatGPT aftershock arrived in full force, vector databases went from obscure academic tooling to a funded category with 15+ vendors, and every data team was suddenly expected to have an opinion on RAG architectures.

Meanwhile, the open table format war between Apache Iceberg and Delta Lake reached a critical moment — and the DuckDB revolution quietly changed how a generation of engineers thought about local-first analytics. It was a year of genuine technical progress obscured by enormous amounts of hype. Let's separate the two.

The ChatGPT Aftershock Hits Data Engineering

ChatGPT launched at the end of 2022, but the actual impact on data engineering practice hit in 2023. Three changes were immediate and real:

1. Text-to-SQL became viable. Every BI vendor (Looker, Tableau, Power BI, Metabase) announced or shipped natural language query features in 2023. Some were LLM wrappers over their existing SQL engines; some were more sophisticated. The quality was inconsistent but improving. For the first time, non-technical users could sometimes get answers from a data warehouse without waiting for an analyst. The "will LLMs replace data analysts?" debate started in earnest.

2. Vector databases exploded. The core use case driving vector DB adoption was RAG (Retrieval-Augmented Generation) — grounding LLM responses in enterprise knowledge. To build RAG systems, you needed to store and query text embeddings efficiently. Vector DBs (Pinecone, Weaviate, Qdrant, Milvus, Chroma) went from "niche ML tooling" to "everyone's evaluating something" in about 6 months. Pinecone raised $100M. Weaviate raised $50M. Qdrant (bootstrapped, open-source) grew to millions of downloads. pgvector went from obscure Postgres extension to a serious contender.

3. Data pipelines for AI became a real workload type. Embedding generation, chunking strategies, semantic cache invalidation, model evaluation data management — these weren't things data engineers had to care about before 2023. By the end of 2023, many data teams owned the data infrastructure for their organization's LLM applications. The ML engineer and data engineer roles started blurring.

The RAG data engineering problem: Building a vector store is easy. Keeping it fresh isn't. When source documents change, embeddings need to be regenerated for the changed chunks, deleted for removed content, and added for new content. This is an incremental data pipeline problem — exactly the kind of thing data engineers know how to solve. The fact that the payload is embeddings rather than revenue figures doesn't change the underlying pipeline challenge.

Databricks Acquires MosaicML: The Shots Fired

In June 2023, Databricks acquired MosaicML for $1.3 billion. MosaicML was known for efficient LLM training — their MPT models could be fine-tuned or pre-trained on commodity clusters at a fraction of the cost of typical GPT-scale training. The message was clear: Databricks was positioning as the platform where enterprises would train and deploy their own LLMs, not just use OpenAI's APIs.

This acquisition catalyzed the "build vs buy" conversation at large enterprises. If you could fine-tune a capable open model on your own data, on your own Databricks cluster, without sending data to a third-party API — that was compelling for regulated industries. The Databricks-Microsoft (Azure OpenAI) competition got real in the second half of 2023.

Microsoft Fabric GA: Another Platform to Think About

November 2023: Microsoft Fabric reached General Availability. Fabric is Microsoft's attempt at a unified analytics platform — combining Azure Data Factory, Azure Synapse, Power BI, and Purview into a single product with a shared data storage layer (OneLake, built on ADLS Gen2 + Delta format).

The reaction in the data engineering community was... cautious. Fabric's promise of a unified experience was appealing. But the reality of migrating existing Azure data stacks to Fabric was complex, the product had rough edges at GA, and the licensing model (Fabric SKUs replacing individual service billing) was confusing. The teams most enthusiastic about Fabric were those heavily invested in Power BI — for them, the integration was genuinely compelling.

The Open Table Format War: Iceberg vs Delta Gets Hot

The Iceberg vs Delta debate intensified in 2023, with real stakes:

FactorApache IcebergDelta Lake
BackingApache (Apple, Netflix, Tabular origins)Databricks (open-source, Linux Foundation)
Catalog supportREST, Glue, Hive, Nessie, PolarisPrimarily Databricks / Unity Catalog
Engine supportSpark, Flink, Trino, Dremio, SnowflakeSpark, Flink, Trino (via connector), Databricks
Multi-cloud neutralityStrongBetter in 2023 with open-source moves
Merge performanceImproving (copy-on-write vs MOR)Strong with Databricks Photon

The practical outcome: if you're on Databricks, Delta is the natural choice and Unity Catalog works seamlessly with it. If you're on Snowflake, Trino, or a multi-engine setup, Iceberg is better supported. Teams trying to be "engine agnostic" mostly picked Iceberg. The format war didn't end in 2023 — but the consensus was forming that Iceberg had better open-ecosystem momentum.

DuckDB: The Quiet Revolution

If you had to pick one technology that "won" 2023 based on developer enthusiasm-to-cost ratio, it was DuckDB. DuckDB is an in-process OLAP database — it runs in your Python process, reads Parquet files from S3, and can process gigabytes of data locally without a cluster. pip install duckdb.

The implications were significant:

  • Local development of data transformations became realistic on real data samples without cloud costs
  • dbt-duckdb enabled running dbt models locally against real-sized samples without Snowflake credits
  • MotherDuck (managed DuckDB in the cloud) raised $52M — the first serious "serverless analytics for small data" product
  • The "right-sizing" conversation started: do you really need a Snowflake warehouse for 50GB of data? DuckDB says no.

dbt Semantic Layer: MetricFlow Goes Mainstream

dbt Labs acquired Transform (the MetricFlow company) in 2022, and 2023 was when the dbt Semantic Layer became real. The pitch: define your metrics once (revenue, DAU, conversion rate) in YAML, and any downstream tool (BI, notebooks, LLM queries) can query the correct definition without copy-pasting SQL. A semantic layer is the layer between the data model and the consumption layer — enforcing consistent business definitions regardless of which tool is asking.

In practice, the dbt Semantic Layer integration with BI tools was partial in 2023 — Tableau support was limited, most teams were still copy-pasting metric definitions across dashboards — but the architecture was sound and the trajectory was clear.

2023 was genuinely disorienting. The data engineering skills that were valuable in 2021 — Airflow DAGs, dbt models, Snowflake optimization — remained valuable, but a whole new layer of AI-adjacent skills became necessary simultaneously. The data engineers who thrived were those who could build vector pipelines alongside batch ETL, who understood embedding generation alongside SQL optimization, and who could explain RAG architectures to executives who had just watched a ChatGPT demo and wanted one immediately.