← Back to Blog

Azure Data Factory Deep Dive: Pipelines, Data Flows, and When to Stop Using ADF

Azure Data Factory is one of those tools that gets dramatically more useful once you understand its execution model — and dramatically less useful once you try to use it for things it wasn't designed for. ADF is a cloud-native orchestration and ELT service. It's excellent at moving data between heterogeneous systems and applying lightweight transformations at scale. It struggles badly when you try to use it as a general-purpose ETL engine for complex business logic.

This article covers the ADF architecture and execution model, the key differences between pipelines (orchestration) and Data Flows (Spark-backed transformation), the integration runtime options, practical cost patterns, and the honest assessment of when ADF is the right tool — and when you should reach for Databricks, dbt, or a custom Python pipeline instead.

ADF Architecture: Three Layers

ADF has three distinct layers that are easy to conflate but have very different properties:

1. Pipelines and Activities: The orchestration layer. A pipeline is a logical grouping of activities that perform a unit of work. Activities can copy data (Copy Activity), run transformations (Data Flow), execute code (Azure Function, Databricks Notebook, HDInsight), or control flow (If Condition, ForEach, Until). Pipelines are defined in JSON and are the primary unit of ADF deployment.

2. Data Flows: The transformation layer. Data Flows are visually designed, code-free transformation graphs that execute on a managed Spark cluster. They're ADF's answer to "how do I do joins, aggregations, and column derivations without writing code?" Each Data Flow compiles to a Spark job that runs on Azure's managed infrastructure.

3. Integration Runtimes (IR): The compute layer. IRs are the actual execution engines. Azure IR (Microsoft-managed) handles cloud-to-cloud data movement. Self-Hosted IR (installed on your VMs/on-prem) handles hybrid scenarios where the source is behind a corporate firewall. Azure-SSIS IR runs legacy SSIS packages in the cloud.

graph TD
    subgraph ADF["Azure Data Factory"]
        P["Pipeline\n(Orchestration + Control Flow)"]
        DF["Data Flow\n(Spark transformation)"]
        CA["Copy Activity\n(Bulk data movement)"]
    end

    subgraph IR["Integration Runtimes"]
        AIR["Azure IR\n(cloud-to-cloud)"]
        SHIR["Self-Hosted IR\n(on-prem / VNet)"]
        SSISIR["Azure-SSIS IR\n(legacy SSIS)"]
    end

    subgraph Sources["Data Sources"]
        ABS["Azure Blob Storage\nADLS Gen2"]
        SQL["Azure SQL\nSynapse"]
        OnPrem["On-Prem SQL\nOracle, SAP"]
        SaaS["Salesforce\nServiceNow, REST APIs"]
    end

    P --> CA
    P --> DF
    CA --> AIR
    DF --> AIR
    CA --> SHIR
    AIR --> ABS
    AIR --> SQL
    AIR --> SaaS
    SHIR --> OnPrem
          

ADF's three-layer architecture: Pipelines orchestrate, Copy Activity moves data, Data Flows transform — all powered by one of three Integration Runtime types.

Copy Activity: ADF's Core Strength

Copy Activity is what ADF does best: bulk data movement between 100+ supported connectors. The performance architecture is straightforward — Copy Activity uses a parallel read/write model where you specify partition options (physical, dynamic range, or column-based) and ADF parallelizes reads. For loading Azure SQL Database → ADLS Gen2 at high throughput, Copy Activity with parallelCopies: 8-16 and appropriate partitioning will outperform custom Python code by a wide margin because it bypasses the Python runtime entirely.

{
  "name": "CopyFromSQLToADLS",
  "type": "Copy",
  "typeProperties": {
    "source": {
      "type": "AzureSqlSource",
      "partitionOption": "DynamicRange",
      "partitionSettings": {
        "partitionColumnName": "order_date",
        "partitionUpperBound": "2023-12-31",
        "partitionLowerBound": "2023-01-01"
      }
    },
    "sink": {
      "type": "ParquetSink",
      "storeSettings": {
        "type": "AzureBlobFSWriteSettings",
        "copyBehavior": "PreserveHierarchy"
      }
    },
    "parallelCopies": 16,
    "enableStaging": false
  }
}

Data Flows: Spark Without the Cluster Management

Data Flows execute on a managed Spark cluster that ADF provisions on demand. You define transformations visually (or via the JSON definition behind the UI). Each Data Flow compiles to a Spark job. The important things to understand:

  • Cold start latency: The first Data Flow execution in a session incurs a 3–5 minute cluster startup time. Subsequent runs in the same session (Time To Live period) reuse the warm cluster. For pipelines that run Data Flows frequently, setting a 10-minute TTL dramatically reduces per-run overhead.
  • Compute billing: Data Flows are billed per vCore-hour on the execution cluster. A default Small cluster (8 vCores) costs ~$0.26/vCore-hour. For a Data Flow that runs for 10 minutes on 8 vCores: $0.26 × 10/60 × 8 = ~$0.35. That's cheap per-run but can add up for hourly batch jobs.
  • What Data Flows are good for: Joins, aggregations, derived columns, pivots, schema drift handling. Complex transformations that would require multiple SQL CTEs are often cleaner to express in a Data Flow graph.
  • What Data Flows are bad for: Row-by-row business logic (Spark doesn't like that), complex Python UDFs (limited support), anything requiring external API calls per record.

Triggers: Scheduling and Event-Driven Patterns

ADF supports four trigger types: Schedule, Tumbling Window, Storage Event, and Custom Event. The one that causes the most confusion is Tumbling Window:

Unlike Schedule triggers (which fire at a point in time and don't track history), Tumbling Window triggers partition time into fixed, non-overlapping windows and guarantee exactly-once execution per window. If your pipeline fails for a window, the trigger retries that specific window without losing track of what ran. For incremental load pipelines where you need to process each time period exactly once — and backfill is possible — Tumbling Window is the right choice. Schedule triggers are for "fire and forget" jobs where you don't need ordered, complete window processing.

The ADF vs Alternatives Decision

Use CaseBest ToolWhy Not ADF
Move data between 100+ connectors (SaaS, on-prem, cloud)✅ ADF Copy Activity
Light transformation + load to ADLS / Synapse✅ ADF Data Flow
Complex multi-table SQL transformationsdbt (via Databricks)Data Flow UI gets unwieldy; SQL is cleaner
Python-heavy data processing (ML features, NLP)Databricks / Azure FunctionsData Flow Spark UDF support is limited
CDC / streaming ingestionAzure Event Hubs + FlinkADF is batch-only
Orchestrating Databricks notebooks / dbt jobs✅ ADF Pipeline activities— (ADF as orchestrator, not transformer)

The right ADF mental model: Use ADF as the orchestrator, not the transformer, for complex pipelines. ADF pipelines calling Databricks Notebook Activity (for Spark/ML), Azure Function Activity (for Python logic), and dbt Cloud tasks — with Copy Activity for the actual data movement legs — is a very common and effective production pattern. ADF's strength is its breadth of connectors and managed scheduling, not its transformation expressiveness.

Common Production Mistakes

1. Debug mode left running: ADF's Data Flow debug mode spins up a Spark cluster and keeps it running (you control the TTL, up to 4 hours). Developers who forget to turn off debug mode accumulate hundreds of dollars in idle cluster cost per month. Set the default TTL to 30 minutes and turn it off when done.

2. Source-side filtering in Copy Activity bypassed: When copying from Azure SQL, always push down filters to the source query (sqlReaderQuery) rather than reading the whole table and filtering in ADF. Reading 500M rows to filter to 10K is 500x more data movement and cost than needed.

3. No parallelism in ForEach activities: By default, ForEach activities execute sequentially. Set isSequential: false and a reasonable batchCount (4–8) to parallelize processing across multiple items. A ForEach over 50 tables running sequentially at 2 minutes each = 100 minutes; in parallel at batch count 8 = ~15 minutes.

4. Not using Managed Virtual Network: By default, Copy Activity runs in the shared Azure IR environment. For sensitive data or when connecting to private endpoints, enable Managed Virtual Network and managed private endpoints. The tradeoff: it adds provisioning latency but eliminates public internet data paths.