โ† Back to Blog

CoddSpeed: Inside Microsoft Fabric's GPU-Accelerated Query Engine

๐Ÿ† SIGMOD 2026 Best Industry Paper โ€” Interlandi, Bruno, Haynes, Curino, Sen et al., Microsoft Research & Fabric Engineering

At Microsoft Build 2026, Microsoft announced GPU-accelerated query processing in Fabric Data Warehouse โ€” no query rewrites, no schema changes, no new data pipelines. Just faster SQL. The engineering underneath that demo is called CoddSpeed, and it is simultaneously a shipping product and a SIGMOD 2026 Best Industry Paper. The paper is unusually candid about what it actually took to go from a research prototype running on PyTorch to a production engine maintained by 60+ engineers that delivers 30ร— speedups at multi-node scale.

This article walks the full technical stack: the Tensor Query Processor research lineage that started it, the two abstraction layers that made it production-viable, how the GPU execution pipeline works, how the optimizer routes fragments without rewriting queries, and what the benchmark numbers actually mean. There is a lot to unpack โ€” this is one of the more substantive systems papers to come out of a cloud vendor in years.

The Problem: CPUs Are the Wrong Tool for Analytical Scans

The core contention of CoddSpeed is not new, but the paper states it precisely. Modern analytical queries are dominated by a small set of operations โ€” table scans with predicate evaluation, hash joins, group-by aggregations, and arithmetic projections โ€” that share a common shape: they process large amounts of data with low computational complexity per byte. The work is memory-bandwidth-bound, not compute-bound.

A server-class GPU like an H100 has roughly 3.35 TB/s of HBM bandwidth. A dual-socket CPU server peaks around 300โ€“400 GB/s of DDR5 bandwidth โ€” an order of magnitude lower. For operations that scan and filter hundreds of gigabytes, the GPU's memory subsystem is structurally the better fit, if you can keep it fed and avoid drowning the gains in PCIe transfer overhead. CoddSpeed's architecture is fundamentally an answer to both of those conditions.

The Research Lineage: TQP

CoddSpeed does not emerge from nowhere. Its GPU execution engine is derived from TQP (Tensor Query Processor), a Microsoft Research project that started with a provocative question: what if you compiled SQL queries into tensor operations and ran them on deep learning hardware? The original 2022 VLDB paper showed that PyTorch's tensor runtime โ€” designed for neural network training โ€” could execute relational operators with competitive performance, achieving up to 20ร— speedups over CPU-only systems on standard benchmarks.

TQP's key insight was that relational algebra and tensor algebra share structural similarities: both involve element-wise transformations, reductions across axes, and positional indexing. A filter predicate is an element-wise boolean mask. A group-by aggregation is a scatter-reduce. A hash join is a lookup table construction followed by a gather. PyTorch's primitives cover all of these, and because PyTorch targets GPUs, the query processor inherits GPU acceleration essentially for free.

The academic prototype worked. The production problem was everything around it: how do you embed a PyTorch-based executor inside a distributed SQL engine? How do you manage GPU memory when queries arrive concurrently? How do you handle the 30โ€“40% of real-world SQL constructs that don't map cleanly to tensor ops? And how do you do this without rewriting the optimizer, the frontend, or the storage layer? Those are the questions CoddSpeed answers.

Architecture: Two Abstraction Layers

CoddSpeed introduces two complementary abstractions that let the GPU executor slot into Fabric's existing distributed engine without replacing it. The division is clean: one layer handles compute, the other handles data movement.

graph TD
    subgraph FW["Microsoft Fabric Data Warehouse"]
        FE["SQL Frontend\n(parse ยท bind ยท auth)"]
        OPT["Distributed Optimizer\ncost-based, fragment routing\naware of CAL capabilities"]
        subgraph DIST["Distributed Execution Engine"]
            HOST["CPU Host Execution\n(fallback, incompatible ops)"]
            CAL["CAL โ€” Coprocessor Abstraction Layer\nhardware-agnostic sub-plan API\nSerialises โ†’ Substrate plan\nZero-copy Parquet / columnar feed"]
            TQP["GPU Coprocessor\n(TQP / CoddSpeed engine)\nCustom CUDA kernels\nLibTorch memory mgmt"]
        end
        DAL["DAL โ€” Data Abstraction Layer\nunified shuffle & caching\nNVLink ยท PCIe ยท InfiniBand ยท Ethernet\nsingle key/value interface"]
    end

    FE --> OPT
    OPT --> HOST
    OPT --> CAL
    CAL --> TQP
    HOST <--> DAL
    TQP <--> DAL
          

CoddSpeed's two-layer architecture inside Fabric Data Warehouse. CAL (Coprocessor Abstraction Layer) handles compute offload โ€” exposing GPU capabilities to the optimizer and serialising sub-plans for GPU execution. DAL (Data Abstraction Layer) handles data movement, presenting NVLink, InfiniBand, PCIe and Ethernet as a single key/value transport. The CPU host path remains for operations the GPU cannot handle.

CAL: Coprocessor Abstraction Layer

CAL is the interface between Fabric's existing distributed optimizer and the GPU executor. Its design reflects a deliberate principle from the paper: integrate the accelerator alongside the existing engine, don't replace it. CAL exposes the GPU coprocessor as a "capability surface" โ€” a set of operators and operator configurations that can run on GPU. The optimizer queries this surface at plan time to understand what it can offload.

When the optimizer decides to route a fragment to the GPU, CAL serializes that fragment as a Substrate plan โ€” a hardware-neutral intermediate representation โ€” and feeds data to the GPU coprocessor in Parquet or SQL Server columnar format. Where the data is already in columnar layout in memory, the handoff is zero-copy. CAL also handles per-partition fallback: if a specific partition of data arrives in a format or schema that the GPU executor cannot handle, that partition runs on CPU while the rest run on GPU. This partition-level granularity is critical for production reliability โ€” it prevents a single malformed partition from forcing the entire query to CPU.

DAL: Data Abstraction Layer

Data movement is the other half of the challenge. In a distributed warehouse running multi-GPU execution, data shuffles between nodes. CoddSpeed may run across machines connected by NVLink, InfiniBand, PCIe, or plain Ethernet โ€” sometimes all four in the same query execution. Writing shuffle logic for each combination is unsustainable.

DAL presents a single key/value interface over all of these transports. Above DAL, neither the query executor nor the optimizer sees NVLink vs PCIe โ€” they see a cache and shuffle service with consistent semantics. Below DAL, the implementation selects the fastest available transport for each data movement: NVLink for GPU-to-GPU within a node, InfiniBand (or NVIDIA Infinity Fabric) for cross-node GPU transfers, PCIe for CPU-GPU on the same machine, Ethernet as the fallback. This is what makes the multi-GPU scaling numbers achievable without hand-coded network topology awareness in the query executor.

The GPU Execution Pipeline

Inside the GPU coprocessor, CoddSpeed's execution model evolved meaningfully from the original TQP prototype. The production system retains the tensor-operation framing โ€” relational operators compiled to GPU-executable tensor computations โ€” but replaces PyTorch's generic kernels with custom CUDA implementations for the hot path.

What gets accelerated

The GPU execution engine handles the operators that dominate analytical query runtimes:

  • Hash joins โ€” build and probe phases both on GPU. The build side constructs a hash table in GPU HBM; the probe side streams through it. GPU parallelism means millions of probe lookups execute simultaneously across thousands of CUDA cores.
  • Group-by aggregations โ€” implemented using GPU reduction primitives and parallel sort/hash strategies. For high-cardinality aggregations, the parallel hash approach dominates; for low-cardinality, sort-based reduction is more cache-friendly.
  • Table scans with predicate pushdown โ€” columnar data lands in GPU memory and predicate evaluation runs as an element-wise vectorized operation across the entire column simultaneously. The effective throughput matches the HBM bandwidth ceiling.
  • Arithmetic projections โ€” column-level arithmetic, string operations, and type casts, all implemented as fused element-wise CUDA kernels to minimize intermediate materializations.

Operators that don't fit โ€” certain recursive CTEs, window functions with complex framing, user-defined functions โ€” fall back to the CPU path transparently. The optimizer knows at plan time which operators are GPU-eligible via the CAL capability surface, so fallback is deterministic, not a runtime surprise.

From PyTorch to custom CUDA

This transition is one of the paper's more candid engineering sections. The TQP prototype used PyTorch's tensor API throughout โ€” which meant depending on PyTorch's generic GPU kernels. Generic kernels are correct but not optimal: they handle arbitrary data types and shapes, so they carry overhead that a purpose-built SQL kernel for, say, 64-bit integer hash joins on a known schema, doesn't need.

In production, CoddSpeed replaced the hot-path operators with custom CUDA kernels tuned to the specific shapes and type profiles of SQL analytics. LibTorch (the C++ PyTorch runtime) is retained for GPU memory management โ€” allocation, deallocation, the memory pool, and synchronization primitives โ€” but the compute kernels for joins, aggregations, and scans are custom-written. The memory management layer handles the tricky parts: tracking which tensors live on CPU vs GPU, managing the limited GPU HBM budget across concurrent queries, and orchestrating explicit PCIe transfers when data needs to move.

Fragment-level, not operator-level offload

This is the design decision the paper emphasizes most strongly: push large query fragments to the GPU, not individual operators. The naive approach to GPU query acceleration โ€” offload each operator independently, round-trip data through CPU memory between operators โ€” destroys the performance advantage. Each PCIe round trip costs hundreds of milliseconds on a full-scan workload, and you need many operators to make a query.

CoddSpeed instead evaluates the whole sub-plan fragment โ€” potentially a multi-operator pipeline spanning a scan, multiple filters, a join, and an aggregation โ€” on the GPU without touching CPU memory in between. Intermediate results stay in GPU HBM across operator boundaries. Data moves to the CPU only when the fragment result is complete and needs to be returned to the distributed query coordinator. This is the key to hitting 10โ€“30ร— speedups rather than 2โ€“3ร—.

sequenceDiagram
    participant OPT as Optimizer
    participant CAL as CAL Layer
    participant GPU as GPU Coprocessor (HBM)
    participant DAL as DAL (shuffle)
    participant CPU as CPU Host

    OPT->>CAL: Route eligible fragment (scan+filter+join+agg)
    CAL->>GPU: Serialize Substrate plan + columnar data (zero-copy if in-mem)
    Note over GPU: Scan โ†’ predicate eval โ†’ hash build
โ†’ hash probe โ†’ group-by agg
All intermediate data stays in HBM GPU->>DAL: Fragment result (HBM โ†’ shuffle buffer) DAL->>CPU: Result delivery via PCIe / NVLink CPU-->>OPT: Merge with other fragments Note over CPU: Ineligible ops (CTEs, UDFs)
run on CPU concurrently

Fragment-level offload: the multi-operator pipeline (scan โ†’ filter โ†’ join โ†’ aggregate) executes entirely within GPU HBM without any CPU round-trip between operators. Intermediate results stay in GPU memory across operator boundaries; only the final fragment result crosses the PCIe/NVLink bus. This is what makes the 10โ€“30ร— numbers achievable โ€” the alternative, operator-level round-tripping, would reduce GPU gains to near zero.

The Optimizer's Role: Routing Without Rewriting

A key claim in both the paper and the product announcement is that CoddSpeed requires no query rewrites. The routing decision is entirely inside Fabric's distributed optimizer. Here is what the optimizer actually does:

  • Capability lookup via CAL: before optimization, the optimizer queries CAL for the current GPU capability surface โ€” which operators, type combinations, and join sizes can run on GPU.
  • Cost-based fragment routing: the optimizer treats GPU execution as an alternative execution strategy for eligible sub-plans, with an estimated cost model that accounts for data movement overhead, GPU memory pressure, and expected speedup.
  • Fragment boundary selection: the optimizer selects fragment boundaries to maximize the amount of compute that stays on GPU before a result must cross the bus. A join feeding directly into an aggregation should stay in one GPU fragment, not two.
  • Plan caching: GPU-routed plans are cached. Repeated queries with the same plan shape don't re-run the routing logic.

From the user's perspective this is invisible. The same SELECT statement that ran on CPU runs on GPU, with results identical to floating-point precision constraints. No hints, no annotations, no schema changes.

Benchmark Results

The paper reports a well-structured benchmark suite. The numbers are worth reading carefully because they answer different questions depending on hardware configuration.

Configuration Workload Speedup (warm) Speedup (cold)
Single A100TPC-H SF=100 (100 GB)7.9ร—4.7ร—
Single H100 vs. A100Same workload+50% over A100โ€”
Single GPU (internal)Customer workload, 300 GB14.9ร—9.7ร—
8ร— H100 + NVLink (DAL)TPC-H large scale27.1ร—โ€”
16ร— H100, 2-nodeTPC-H large scale30.4ร—โ€”
vs. 3 cloud DWs, 64 users100 GB, 22-query setup to 7ร—โ€”

A few things worth unpacking here:

Warm vs cold matters a lot. The "warm" numbers assume data is already in GPU HBM or at least in host memory, so the PCIe transfer cost is amortized. The cold numbers โ€” where data must be read from storage, moved through CPU memory, and then transferred to GPU โ€” still show substantial speedups (4.7ร— on TPC-H SF=100) but are roughly half the warm numbers. For interactive BI workloads where the same tables are queried repeatedly, the warm numbers are more representative. For one-off batch queries, the cold numbers apply.

NVLink changes the multi-GPU story. The jump from single-GPU (7.9ร—) to 8-GPU with NVLink (27.1ร—) is near-linear scaling, which is remarkable. Without NVLink โ€” GPU-to-GPU data moving through PCIe and CPU memory โ€” multi-GPU scaling degrades badly because shuffle overhead dominates. DAL's NVLink-aware routing is what makes the 27ร— number real rather than theoretical.

The 7ร— vs competitors number is concurrency-gated. At single-user concurrency, the CoddSpeed advantage over CPU-based cloud warehouses is ~3โ€“4ร—. At 64-user concurrency, it reaches 7ร— โ€” because CPU-based systems degrade under parallel workloads (shared memory bandwidth becomes the bottleneck and queries start queuing), while the GPU's massive parallelism means it degrades less. This is the most commercially relevant number, as BI workloads almost always involve concurrent users.

The warm/cold asymmetry is also an architecture hint. Teams that run the same analytical workloads repeatedly โ€” daily BI dashboards, regularly scheduled aggregations โ€” will see closer to the 15ร— numbers because their data is effectively pre-staged. Teams running purely ad-hoc queries over cold data get 5โ€“10ร—. Both are meaningful; they're just different use cases with different expectations to set.

Production Lessons: What the Paper Is Unusually Honest About

SIGMOD industry papers often read as polished success stories. CoddSpeed's paper is atypically candid about what was hard. Four lessons stand out.

1. Don't replace the existing system

Early internal prototypes explored running the entire query engine on GPU โ€” replacing the existing distributed executor. This failed. The existing engine handles too many things the GPU cannot: complex DDL, error handling, management operations, the long tail of SQL constructs. Integrating CoddSpeed alongside the existing engine (via CAL's fallback mechanism) let them ship with incomplete GPU coverage and improve coverage incrementally, without a flag-day cutover.

2. Fragment granularity over operator granularity

Already discussed above, but the paper quantifies the point: operator-level offload recovers a fraction of the potential speedup because transfer overhead amortizes poorly at single-operator granularity. The decision to make fragments the unit of offload โ€” requiring a non-trivial optimizer change โ€” was necessary, not optional.

3. Abstract both compute and network simultaneously

The DAL layer was not in the original design. The team initially assumed they could use the existing Fabric shuffle infrastructure for GPU-to-GPU data movement. That assumption broke at multi-node scale when NVLink-capable nodes appeared in the hardware fleet โ€” the existing shuffle code had no path for GPU-native data transfer and was forced to round-trip through CPU memory. Building DAL before scaling to multi-GPU was retroactively identified as the correct sequencing.

4. PyTorch is the right prototype platform, not the right production platform

PyTorch's generic kernels and Python overhead are acceptable in a research prototype where correctness matters more than throughput. In production, they become the bottleneck. The custom CUDA rewrite of hot-path operators added months of engineering effort but was responsible for a significant fraction of the final speedup numbers. The paper frames this as expected: research prototypes optimize for flexibility; production systems optimize for throughput on the specific workloads they actually run.

What It Means for the Fabric Data Warehouse User

CoddSpeed entered Early Access Preview in July 2026, rolling out across four regions with more to follow. From a practitioner's perspective, the important practical facts are:

  • No changes required. Existing Fabric Data Warehouse tables, queries, pipelines, and semantic models work without modification. The optimizer decides what runs on GPU.
  • Concurrency is where it shines. If you have a single analyst running occasional heavy queries, the win is real but modest. If you have 20+ users hitting dashboards simultaneously, the GPU's parallel execution model changes the performance profile substantially.
  • Joins and aggregations benefit most. Queries dominated by large hash joins or high-cardinality aggregations over wide tables will see the biggest speedups. Simple point lookups and single-table filters benefit less.
  • Cold start matters for latency SLAs. If your workload cares about first-query latency on cold data, the 4โ€“5ร— cold number is the right expectation. If you care about sustained throughput on warm data, plan for 10โ€“15ร—.
  • AMD and Intel accelerators are planned. The CAL abstraction was designed to host multiple accelerator backends. NVIDIA H100/A100 is the first GA target; AMD MI300X and Intel Gaudi support are on the roadmap. This is not a one-GPU bet.

Broader Significance: Research-to-Production as a Systems Discipline

The reason CoddSpeed won Best Industry Paper at SIGMOD 2026 is not the speedup numbers โ€” 10โ€“30ร— GPU speedups have appeared in GPU database research for a decade. The reason is the architecture of the transition: what organizational and technical decisions allowed a research prototype running on PyTorch to become a production feature in a cloud data warehouse used by hundreds of thousands of customers, without rewriting the query frontend, without requiring schema changes, and with a graceful fallback for every SQL construct the GPU can't yet handle.

That transition โ€” TQP research paper (2022) โ†’ internal prototype โ†’ CoddSpeed production (2026) โ€” took four years and grew from a small research team to 60+ engineers. The paper's candid accounting of what had to change along the way (generic kernels โ†’ custom CUDA, operator-level โ†’ fragment-level, no shuffle abstraction โ†’ DAL, replace the engine โ†’ coexist with it) is more useful to the systems community than any individual performance number.

For Fabric users the takeaway is simpler: the GPU acceleration is real, it's architecturally sound, it requires nothing from you, and the benchmark numbers were produced on the same hardware you'll use in production. The 7ร— concurrency advantage over the field is the number to watch.

Source paper: Interlandi, Bruno, Haynes, Curino, Sen et al., "CoddSpeed: Hardware Accelerated Query Processing in Microsoft Fabric." SIGMOD 2026 Industrial Track, Best Paper. arXiv:2506.09226.