AWS Glue Deep Dive: Crawlers, Job Types, Iceberg Integration, and the Cost Traps

AWS Glue is both simpler than people expect and more complicated than the documentation makes it look. Simpler, because at its core it's a serverless Spark service with a metadata catalog bolted on. More complicated, because the three different job types (Glue ETL, Glue Streaming, Glue for Ray), the DynamicFrame abstraction layered on top of Spark's DataFrame API, and the surprisingly punishing cost model when used carelessly make it easy to do things that look correct but cost 10x what they should.

This article covers the Glue Data Catalog and Crawlers (the underappreciated part), the three job execution environments, when to use DynamicFrames vs DataFrames, Iceberg integration, and the cost traps that bite teams once they scale.

The Glue Data Catalog: More Than a Metastore

The Glue Data Catalog is a managed Apache Hive-compatible metastore that's shared across most AWS analytics services: Athena, EMR, Redshift Spectrum, Lake Formation, and Glue ETL jobs. When you register a table in the Glue Catalog, you can query it from Athena, run Spark jobs against it in Glue or EMR, and enforce Lake Formation access policies — all using the same catalog entry. This cross-service sharing is the Catalog's primary value.

The catalog has a hierarchical structure: Database → Tables → Partitions. Each table entry stores schema, S3 location, file format (Parquet, ORC, CSV, JSON), and partition scheme. For Iceberg tables, the Catalog stores the table's Iceberg metadata location (the path to the metadata JSON), not the underlying data files directly.

Crawlers: Useful But Overused

Glue Crawlers automatically discover schema from S3 files and register/update tables in the Catalog. For one-time schema discovery or when you genuinely don't control the data producer, Crawlers are useful. For managed data pipelines where you control the schema, Crawlers are unnecessary overhead — you know the schema, just define the table directly.

Crawler trap: A Crawler run costs $0.44/DPU-hour with a 10-minute minimum, billed in 1-second increments. Crawling a large S3 bucket with thousands of files costs money every run. Teams that schedule hourly Crawlers on frequently-updated tables can easily spend $50–200/month just on schema discovery — for metadata they could maintain directly. Use Crawlers selectively; prefer MSCK REPAIR TABLE or explicit partition registration for incremental partition adds.

Glue ETL: The Three Job Types

Glue ETL (Spark)

The primary job type — Spark-based batch processing. You write Python or Scala code that runs on a managed Spark cluster. Key parameters: --GlueVersion (4.0 supports Spark 3.3), --NumberOfWorkers, --WorkerType (G.1X = 4 vCPU / 16 GB, G.2X = 8 vCPU / 32 GB), and --MaxRetries.

Billing: $0.44/DPU-hour, with 2 DPUs per G.1X worker + 1 DPU for the driver. A 5-worker G.1X job = 11 DPUs × $0.44 × (job duration). Minimum 1 minute. Cold start: 2–4 minutes for cluster provisioning.

Glue for Ray

Added in 2023, Glue for Ray runs Python workloads on a managed Ray cluster. Useful for distributed Python tasks that don't need Spark's JVM overhead — ML inference, Python-based data processing, image/audio transformation. Different worker types (Z.2X, etc.) and pricing than Spark jobs.

Glue Streaming

Runs a continuously running Spark Structured Streaming job against Kinesis or Kafka. Billed per second. For production streaming workloads at scale, MSK + Flink or Kinesis + Lambda often has better cost/performance characteristics than Glue Streaming, but Glue Streaming has lower operational overhead.

DynamicFrames vs DataFrames

Glue introduces its own abstraction on top of Spark's DataFrame: the DynamicFrame. The selling point: DynamicFrames handle schema inconsistencies (missing columns, type conflicts) gracefully — they store errors as a parallel error column rather than failing. This is useful for messy, inconsistently schemaed data.

The problem: DynamicFrame operations are significantly slower than native DataFrame operations for clean, well-schemaed data. Every resolveChoice() and relationalize() call adds overhead. For 90% of ETL jobs where your data has a consistent schema, you should convert to a DataFrame immediately and use it throughout:

from awsglue.context import GlueContext
from awsglue.transforms import *

glue_context = GlueContext(spark_context)

# Read as DynamicFrame (needed for Glue Catalog integration)
dyf = glue_context.create_dynamic_frame.from_catalog(
    database="raw",
    table_name="orders"
)

# Convert to DataFrame immediately for all transformations
df = dyf.toDF()

# All heavy transformation work in native PySpark
df_clean = df \
    .filter(col("order_date") >= "2023-01-01") \
    .withColumn("total_with_tax", col("total") * 1.1) \
    .groupBy("customer_id").agg(...)

# Convert back only for writing via Glue sink
from awsglue.dynamicframe import DynamicFrame
output_dyf = DynamicFrame.fromDF(df_clean, glue_context, "output")
glue_context.write_dynamic_frame.from_options(
    frame=output_dyf,
    connection_type="s3",
    connection_options={"path": "s3://my-bucket/processed/orders/"},
    format="parquet"
)

Iceberg Integration in Glue 4.0

Glue 4.0 (Spark 3.3) has native Apache Iceberg support. You can read and write Iceberg tables stored in S3 with ACID semantics, schema evolution, and time travel — registered in the Glue Catalog as Iceberg tables:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .config("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.glue_catalog", "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config("spark.sql.catalog.glue_catalog.warehouse", "s3://my-bucket/iceberg-warehouse/") \
    .getOrCreate()

# Upsert using MERGE INTO (Iceberg)
spark.sql("""
    MERGE INTO glue_catalog.analytics.customers AS target
    USING staging_customers AS source
    ON target.customer_id = source.customer_id
    WHEN MATCHED THEN UPDATE SET *
    WHEN NOT MATCHED THEN INSERT *
""")

# Time travel
df = spark.read \
    .format("iceberg") \
    .option("as-of-timestamp", "2023-10-01T00:00:00") \
    .load("glue_catalog.analytics.customers")

The Cost Traps

1. Warm pool not enabled: By default, each Glue job run waits for a new cluster to provision (2–4 minutes). Enable Glue Streaming Workflow or use Glue Job Bookmarks to avoid repeated cold starts. For jobs that run frequently (every 15 minutes), the cold start overhead can dominate execution time.

2. MaxDPUs not set: If you don't set --MaxCapacity or --NumberOfWorkers, Glue defaults to 10 DPUs. A small job processing 1 GB of data doesn't need 10 DPUs — it runs on 2 just fine and costs 5x less.

3. Reading entire S3 prefix without pushdown: Glue reads S3 files via the Catalog partition metadata. If your table isn't partitioned or your job doesn't filter on partition columns, Glue reads all files. On a large table (10 TB), that's expensive even if you only need 1% of the data.

4. Using Glue for what Step Functions + Lambda does better: Short-duration data processing tasks (under 2 minutes of actual computation) are often cheaper in Lambda (billed by the millisecond, no cold start for provisioned concurrency) or EMR Serverless (faster startup than Glue, more granular pricing) than Glue ETL jobs with their 1-minute minimum billing and cluster provisioning overhead.

Glue is the right choice for AWS-native Spark ETL where you want zero cluster management, Glue Catalog integration as the metadata layer, and tolerate the cost model. For teams building large-scale, cost-sensitive data platforms on AWS, EMR on EC2 (with Spot) or EMR Serverless often provides better cost control at the expense of more operational involvement.