Two things happened in 2022 that defined the next few years of data engineering. First: data mesh went from concept to controversy. Zhamak Dehghani published her book "Data Mesh" in March 2022, and every data organization with more than 50 engineers suddenly had a mandate to "do data mesh" — often without understanding what that meant or what it would cost. Second: on November 30, 2022, OpenAI launched ChatGPT. It didn't matter to data engineers yet. But it would, very soon.
In between those two events, the mundane realities of scaled modern data stacks caught up with the hype. Data quality became a real crisis at organizations that had spent two years loading everything into Snowflake and BigQuery. Streaming became more accessible but didn't replace batch for most teams. The format wars between Iceberg, Delta, and Hudi started heating up. And data contracts — the idea that data producers should make explicit commitments to downstream consumers — emerged as a practical response to the data mesh governance problem.
Data Mesh: The Book, the Hype, and the Hard Reality
Zhamak Dehghani's "Data Mesh" (O'Reilly, March 2022) codified four principles into a coherent framework: domain ownership, data as a product, self-serve infrastructure, and federated computational governance. The book was excellent and the framework was well-reasoned. What followed in practice was messier.
Organizations that actually attempted data mesh implementations in 2022 discovered that the hard part wasn't the technology — it was the organizational change. "Domain ownership" meant convincing the marketing team to own and maintain their data pipelines. The marketing team, reasonably, did not want to own and maintain data pipelines. They wanted their data to be correct, available, and someone else's problem.
The self-serve infrastructure platform piece was equally challenging. Building a platform that makes it genuinely easy for non-data-engineers to publish, document, and maintain data products requires significant investment. Most companies that tried to DIY it in 2022 underestimated the effort by 2–5x. The companies that succeeded were mostly large tech companies with dedicated platform teams of 10–20 engineers. Everyone else was improvising.
The data mesh trap: Data mesh is an organizational operating model, not a technology. You can't buy data mesh from a vendor. You can't implement it in a sprint. Organizations that treated it as a technology project in 2022 spent a lot of money on tooling and training, achieved some of the structural changes (domain catalogs, ownership models), and then stalled when the cultural change required actual authority redistribution. Data mesh requires executives to accept that the central data team will lose control of data — and that's a harder sell than any technical architecture.
Data Contracts: The Practical Response
If data mesh is the organizational philosophy, data contracts are the operational mechanism. The idea: upstream teams (data producers) make formal, versioned commitments about the schema, semantics, and SLAs of the data they publish. Downstream teams (consumers) can rely on these contracts. Breaking changes require contract negotiation, not surprise schema drift.
Chad Sanderson's writing at Convoy popularized the idea in 2022. Tools like Soda (data quality), Great Expectations, and custom YAML contract schemas started appearing. The most practical implementation pattern was simple: a contract is a YAML file in a git repo that specifies column names, types, not-null constraints, and expected freshness. CI/CD validates new data against the contract before promoting to production.
# Example data contract (2022 style)
apiVersion: v1
kind: DataContract
metadata:
name: orders_daily
owner: team-commerce
version: 2.1.0
schema:
- name: order_id
type: STRING
required: true
unique: true
- name: order_date
type: DATE
required: true
- name: customer_id
type: STRING
required: true
- name: total_amount
type: DECIMAL(18,2)
required: true
sla:
freshness: 4 hours
completeness: 99.9%
Data contracts didn't solve data quality — they shifted accountability. With a contract, you knew who to call when data broke. Without one, you were debugging a mystery that could have been caused by anyone in the pipeline.
Streaming Grows Up (A Little)
Kafka was everywhere in 2022, but managing Kafka was still painful enough that many teams chose managed alternatives. Confluent Cloud matured significantly, MSK Serverless (AWS) launched in preview, and Redpanda emerged as a Kafka-compatible alternative built in C++ — promising lower latency and simpler operations by eliminating ZooKeeper and the JVM.
The more interesting streaming story of 2022 was the attempt to make streaming accessible without Kafka at all. Apache Flink's SQL surface improved; Materialize built a Postgres-compatible streaming SQL database; ksqlDB made Kafka Streams accessible via SQL. The vision: analysts who knew SQL should be able to write streaming queries without learning Scala or Java stream processing frameworks.
Reality check: most analytics use cases in 2022 did not need true streaming. Near-real-time (5-minute microbatches via Spark Structured Streaming or dbt + Airflow) handled the actual business requirements, and teams that invested in full streaming architectures often found themselves maintaining complex infrastructure for requirements that batch-plus-incremental could have satisfied.
The Great Tech Layoffs and Their Effect on Data
2022 was also the year of mass tech layoffs — Meta, Twitter, Amazon, Stripe, and dozens of others announced significant cuts starting in mid-2022. Data teams were hit proportionally. This had a paradoxical effect: it forced the "do more with less" mandate on data organizations, which actually accelerated adoption of the modern data stack's efficiency benefits. Teams that lost 30% of their headcount leaned harder into dbt, Fivetran, and automation to maintain output.
It also created a buyer's market for data engineering talent for the first time in years, and a wave of consultants and boutique agencies formed from laid-off data engineers. Some of the best dbt core contributors came from this cohort.
ChatGPT Drops: End of November, End of an Era
November 30, 2022. ChatGPT launches. One million users in five days. Data engineers collectively said "huh, that's impressive" and went back to debugging their Airflow DAGs. It felt like a product demo, not a paradigm shift. We were wrong.
The immediate effect on data engineering was subtle: GitHub Copilot, which had launched in June 2021, became noticeably better at generating SQL and dbt YAML. A few data teams started experimenting with LLM-assisted documentation generation. But the full impact of LLMs on data engineering — vector databases, RAG, AI-assisted SQL everywhere — would take another 12 months to materialize.
What we didn't see coming at the end of 2022: by the end of 2023, "write me a SQL query to..." would be a legitimate first step before opening your editor, vector databases would become a VC-funded gold rush, and the question of whether LLMs would replace data analysts would be earnestly debated at every data conference.
The Tools That Defined 2022
- dbt v1.0 → v1.3: Metrics layer introduced. Python models (dbt-core) became a thing, allowing non-SQL transformations within the dbt DAG.
- Databricks acquires 8080 Labs (Bamboolib): Low-code data transformation in notebooks. Sign of the "analytics democratization" push.
- Monte Carlo, Acceldata, Anomalo: Data observability became a funded category. Data quality monitoring was no longer a DIY Great Expectations project — it was a SaaS vertical.
- Iceberg vs Delta: Apple and Netflix championed Iceberg; Databricks championed Delta. Both gained traction; the format wars would not be resolved in 2022.
- Dagster 1.0: Software-defined assets became the mental model. Thinking in data assets rather than tasks or jobs was a genuine shift in how engineers reasoned about pipelines.
2022 was the year data engineering grew up and got complicated. The simple "load it into Snowflake and query it" story of 2021 collided with the real-world messiness of scale: quality problems, organizational ownership conflicts, format fragmentation, and the first hints of an AI disruption that nobody quite believed yet.