← Back to Blog

State of Data Engineering 2025: Agents Take the Wheel (Mostly)

2025 was the year we stopped arguing about whether AI agents would change data engineering and started dealing with the reality that they already had. Not in the "robot replaces engineer" way that the hot takes predicted — data engineering remains a deeply technical discipline that requires real architectural judgment. But in the quieter, more pervasive way that changes how you spend your day: agents writing first drafts of dbt models, agents running data quality checks and filing tickets automatically, agents answering "why did this pipeline fail" at 3am instead of paging the on-call engineer.

Simultaneously, the technical infrastructure story of 2025 was about standardization catching up to the innovation of 2022–2024. The Iceberg REST Catalog spec became a genuine interoperability standard. Spark 4.0 shipped with a simplified Python-first API. dbt Mesh gave large organizations a coherent multi-project architecture. The streaming lakehouse pattern, which was "advanced architecture" in 2023, became a default pattern at scale in 2025.

Agent-Driven Data Engineering: What Actually Works

The data engineering agent use cases that worked in production in 2025 weren't the dramatic ones. They were the tedious ones — the kind of work that's important but time-consuming and doesn't require the deep architectural judgment that defines senior data engineering work.

dbt Model Generation

Given a source schema and a business requirement ("create a daily active users model by country from the events table"), agents could reliably generate a dbt model with correct SQL, appropriate tests (not_null, unique, accepted_values), and documentation. The first draft was right ~70% of the time; the remaining 30% needed a human to review business logic nuances. Net result: models that took 2–3 hours to write, test, and document were taking 30–45 minutes.

Pipeline Debugging

The dbt Cloud, Databricks, and Airflow observability integrations with AI assistants matured enough that "why did this job fail?" had a useful agent-generated answer most of the time. The agent would look at the error, the DAG structure, recent schema changes, and upstream data quality — and produce a diagnosis that was correct in the common cases (schema mismatch, null values in a not-null column, upstream table missing) and a useful starting point in the complex cases.

Data Quality Anomaly Investigation

Monte Carlo, Acceldata, and the dbt-native data quality tools all shipped agent features that could investigate anomalies automatically: run the relevant dbt tests, check upstream pipeline run history, query the lineage graph, and produce a root-cause hypothesis. This removed 40–60% of the manual investigation work from on-call data engineers.

What agents still can't do reliably in 2025: Architecture design for new data domains, performance optimization that requires understanding query execution plans, complex CDC (change data capture) pipeline design, and anything that requires understanding undocumented business rules that live in someone's head. The ceiling on agent autonomy is the quality of your documentation and data contracts — agents work with what's written down.

Spark 4.0: The Python-First Era

Apache Spark 4.0, released in early 2025, made the Scala/Java API a second-class citizen for new feature development and doubled down on PySpark as the primary interface. The DataFrame API improvements, structured streaming usability enhancements, and the new connect architecture (thin client connecting to a remote Spark server) changed how Spark was deployed and used:

  • Spark Connect GA: A language-agnostic protocol for sending Spark plans to a remote server. Means you can use PySpark, DuckDB, or pandas-on-Spark API interchangeably, with Spark doing the heavy lifting on the server side. Local development on a laptop connects to a remote cluster without a full Spark installation locally.
  • Python UDFs with Pandas/Arrow: Vectorized UDFs using PyArrow are now first-class in Spark 4.0, with significantly better performance than the row-at-a-time Python UDFs that were the original PySpark UDF story.
  • Improved SQL compatibility: Spark 4.0 SQL is substantially more ANSI-compliant, reducing the "works in Snowflake, breaks in Spark" frustration for teams that run SQL across multiple engines.

The Iceberg REST Catalog Standard

The Apache Iceberg REST Catalog specification matured into something that actually enabled engine interoperability in 2025. Snowflake, Databricks, AWS Glue, Polaris (the Snowflake-donated open-source catalog), and several other vendors all implemented the REST Catalog spec. The result: a table registered in one catalog could be read by any engine that supported the spec, without vendor-specific configuration.

This was the closest data engineering had come to the "write once, read anywhere" vision of open table formats being fully realized. It wasn't frictionless — different implementations had different feature coverage, authorization models varied, and the performance of REST Catalog calls at scale was a real concern. But the direction was clear: catalogs were commoditizing, and the value was moving up the stack to the query engine and governance layers.

dbt Mesh: Multi-Project Architecture for Large Organizations

dbt Mesh, which dbt Labs had been building since 2023, became production-ready and widely adopted at larger organizations in 2025. The problem it solves: a single dbt project with 500+ models is ungovernable. Everything depends on everything; schema changes break downstream models; different teams have different deployment cadences but are blocked by a shared CI/CD pipeline.

dbt Mesh introduces cross-project references: Team A's project can depend on a public model in Team B's project via a formal contract interface. Team B can iterate on the internals of their models without breaking Team A, as long as the public interface (schema, column names, semantics) is preserved. This is essentially API versioning for data models.

graph LR
    subgraph Platform["Platform Team"]
        Raw["raw_* staging models\n(public interface v1.2)"]
    end

    subgraph Commerce["Commerce Team"]
        Orders["orders_daily\n(depends on raw_orders)"]
        Revenue["revenue_metrics\n(public interface v2.0)"]
    end

    subgraph Finance["Finance Team"]
        Reporting["monthly_reporting\n(depends on revenue_metrics v2.0)"]
    end

    Raw --> Orders
    Orders --> Revenue
    Revenue --> Reporting

    style Platform fill:#1e3a5f,stroke:#2d5a8e
    style Commerce fill:#0d1f35,stroke:#2d5a8e
    style Finance fill:#0a1628,stroke:#2d5a8e
          

dbt Mesh: separate dbt projects with formal cross-project dependencies via public model interfaces. Platform team owns staging models; Commerce team owns their domain models; Finance team consumes the public revenue interface. Each team deploys independently.

The Streaming Lakehouse Becomes Default at Scale

In 2023, "streaming lakehouse" was an architectural pattern that advanced teams were experimenting with. By 2025, it was the default architecture for data platforms handling >TB/day of incoming data. The pattern:

  1. Events stream into Kafka or Kinesis
  2. Flink processes events and writes to Iceberg tables in S3/GCS/ADLS using MERGE semantics
  3. dbt models transform the Iceberg landing tables via batch jobs (hourly or daily)
  4. BI tools query the transformed tables; high-urgency use cases query the Iceberg landing tables directly

What made this practical at scale in 2025 was Iceberg's improved small-file compaction, Flink's improved checkpointing reliability, and the managed services getting good enough that you didn't need a Flink expert to run this in production (Amazon Managed Service for Apache Flink, Confluent's Flink cloud service).

What Data Engineering Looks Like in 2026

Standing at the end of 2025, a few patterns seem structurally stable:

  • The data engineer's job is increasingly about architecture and governance, less about pipeline construction. Agents handle first-draft pipeline code; engineers design the system, set the standards, and review.
  • The lakehouse is the default pattern. Pure cloud data warehouses still exist, but most new architectures land on open table formats in cloud storage with a compute engine (Spark, Trino, Snowflake Iceberg, DuckDB) on top.
  • Interoperability is winning. The vendor lock-in game of 2018–2022 has largely been defeated by open formats and open catalogs. The new lock-in is at the governance/AI layer.
  • Data contracts are real governance. Not everywhere, but the teams running data contracts in production have noticeably fewer incident bridges and data quality escalations.

Five years ago, data engineering was building ETL pipelines in Spark and managing Hadoop clusters. Today it's designing governance models, implementing agent-assisted quality systems, and making architectural choices about which of 20 competing open standards to bet on. The tools got dramatically better; the judgment calls got harder. Wouldn't have it any other way.