← Back to Blog

Redshift Physical Schema Design: Distribution Keys, Sort Keys, Skew, and the Lessons That Cost Us Weeks

Redshift is one of those systems where the defaults will let you build something that appears to work — queries run, dashboards load — until the data grows to a real scale and you discover that your schema was quietly working against you the whole time. Amazon has done an admirable job adding automatic table optimization (ATO) in recent years, but understanding the physical design decisions Redshift makes (or that you need to make) is still essential if you want predictable, cost-efficient performance.

This is a post-mortem and best-practices article. We built a production Redshift-backed analytics platform over 18 months, hit several significant performance walls, and spent weeks diagnosing and redesigning tables. What follows is everything we learned the hard way.

Redshift's Physical Data Model: The Mental Model You Need

Redshift is a columnar, MPP database. Data is stored column-by-column (allowing queries to read only needed columns), compressed per-column, and distributed across multiple compute nodes (slices). A 2-node ra3.4xlarge cluster has 8 slices — 4 per node. Each row of every table lives on exactly one slice, determined by the distribution style and key.

The query execution model: for most queries, all slices work in parallel on their local data. Aggregations and joins are performed in parallel per slice, then results are combined on the leader node. The performance goal: minimize the amount of data that must move between slices during a query (called a broadcast or redistribution), and maximize work done locally on each slice.

Distribution Styles: The Most Important Physical Decision

Redshift offers four distribution styles:

KEY Distribution

Rows with the same DISTKEY value are stored on the same slice. If you join two large tables on a column, and both tables have that column as their DISTKEY, the join can happen entirely on each slice without any data movement between nodes — a colocated join. This is the best possible join performance in Redshift.

The golden rule: if Table A and Table B frequently join on customer_id, both should use DISTKEY(customer_id). A query joining 100M-row orders and 50M-row sessions both distributed on customer_id runs entirely on local data — fast and no inter-node traffic.

EVEN Distribution

Rows are distributed round-robin across slices, ensuring even data distribution regardless of data content. Good for tables that don't have a clear join key, or that you never join to other large tables. Queries on EVEN-distributed tables require data redistribution during joins — Redshift broadcasts the smaller table to all slices, which works fine as long as one of the join tables is small.

ALL Distribution

A full copy of the table exists on every slice. Great for small dimension tables (under 1–2M rows) that are frequently joined to large fact tables. If the dimension is on ALL, the join needs no data movement — each slice's portion of the fact table can join directly with its local copy of the dimension. Poor choice for large tables — the storage cost multiplies by the number of slices.

AUTO Distribution

Amazon's automatic table optimization. For small tables, Redshift starts with ALL distribution. As tables grow, it automatically migrates to EVEN distribution. It's a reasonable default but doesn't choose KEY distribution for you — you need to specify that explicitly if you want colocated joins.

Skew: The Silent Performance Killer

Skew happens when one or more slices hold significantly more data than others due to a poorly chosen DISTKEY. A query on a skewed table is only as fast as the slowest (most loaded) slice — the other slices finish early and wait.

The skew trap we fell into: We chose status as a DISTKEY on our orders table because most queries filtered by status. There were 5 statuses: 'COMPLETED' held 94% of rows. Result: one or two slices were 20x more loaded than others. All queries touching the orders table were bottlenecked on those slices.

Diagnosing skew:

-- Check skew by looking at row distribution across slices
SELECT
    trim(name) AS table_name,
    slice,
    COUNT(*) AS row_count,
    AVG(COUNT(*)) OVER (PARTITION BY name) AS avg_rows,
    ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (PARTITION BY name), 2) AS pct
FROM stv_tbl_perm
JOIN pg_class ON pg_class.oid = stv_tbl_perm.id
WHERE name = 'orders'
GROUP BY name, slice
ORDER BY name, slice;

-- Or use SVV_TABLE_INFO for quick skew check
SELECT
    table_name,
    diststyle,
    skew_rows,
    skew_sortkey1
FROM svv_table_info
WHERE table_name = 'orders';

A skew_rows value above 1.05 indicates some skew; above 1.5 is problematic; above 4.0 is a serious performance issue.

The DISTKEY selection test: Before choosing a DISTKEY, run SELECT distkey_column, COUNT(*) FROM table GROUP BY 1 ORDER BY 2 DESC LIMIT 20. If the top 5 values account for >50% of rows, that column will create skew. Good DISTKEY candidates have high cardinality and roughly uniform distribution — user IDs, order IDs, customer IDs. Bad candidates: status columns, boolean flags, small-cardinality categorical columns.

Sort Keys: Zone Maps and the Block-Skip Magic

Sort keys determine the physical sort order of rows on disk within each slice. Redshift maintains per-block statistics called zone maps: for each 1 MB data block, Redshift stores the min and max value of the sort key column. When a query has a WHERE clause on the sort key, Redshift skips entire blocks where the queried value falls outside the block's min-max range. For a table sorted by event_date, a query filtering for last 7 days skips all blocks outside that range — dramatically reducing I/O.

Compound vs Interleaved Sort Keys

Redshift supports two sort key types:

Compound sort key (SORTKEY(col1, col2, col3)): Works like a composite index — zone maps are most effective when filtering on the leading column(s). Filtering on col2 alone (without col1) gets poor zone map benefit. Use compound sort keys when your queries consistently filter on the same leading column (e.g., always filter by date first).

Interleaved sort key (INTERLEAVED SORTKEY(col1, col2, col3)): Provides roughly equal zone map benefit for any of the sort key columns individually. More flexible for diverse query patterns, but has higher VACUUM cost — interleaved keys must be rebuilt during VACUUM REINDEX, which is much more expensive than standard VACUUM.

Practical guidance: In our experience, Compound sort keys are the right choice for 90% of tables. If 80% of queries filter by event_date, sort by event_date and the zone map benefit is excellent. Interleaved sort keys sound appealing for multi-pattern access, but the VACUUM REINDEX cost is high enough that we stopped using them entirely after one table with 2B rows took 6 hours to VACUUM REINDEX.

The VACUUM Problem

Unlike PostgreSQL (which Redshift is based on), Redshift does not auto-vacuum by default (though Amazon added automatic vacuum in 2019 for clusters running in maintenance windows). The problem: Redshift marks deleted and updated rows as ghost rows rather than immediately reclaiming space. Over time, tables accumulate ghost rows that are included in scans — you're scanning rows that contribute nothing to the result.

-- Find tables with high ghost row percentage (unsorted / deleted rows)
SELECT
    trim(pgn.nspname) AS schema,
    trim(a.name) AS table_name,
    a.rows AS live_rows,
    a.rows_pre_filter AS total_rows_scanned,
    ROUND(
      (a.rows_pre_filter - a.rows) * 100.0 / NULLIF(a.rows_pre_filter, 0),
      2
    ) AS ghost_row_pct,
    a.unsorted_pct
FROM svv_table_info a
JOIN pg_namespace pgn ON pgn.oid = a.schema_id
WHERE (a.rows_pre_filter - a.rows) > 0
  AND a.rows > 100000
ORDER BY ghost_row_pct DESC;

Tables with ghost_row_pct above 20% benefit significantly from VACUUM. For INSERT-only workloads (new data appended, nothing deleted), VACUUM is less necessary. For tables with frequent UPDATEs or DELETEs (SCD Type 2 updates, CDC-applied changes), VACUUM regularly.

The Physical Schema We Ended Up With

After all the lessons, here's the schema pattern that works well for our largest fact table:

CREATE TABLE fact_orders (
    order_id      BIGINT NOT NULL,
    customer_id   BIGINT NOT NULL,
    product_id    INT NOT NULL,
    order_date    DATE NOT NULL,
    status        VARCHAR(20) NOT NULL ENCODE zstd,
    amount_cents  BIGINT NOT NULL ENCODE az64,
    region        VARCHAR(50) ENCODE zstd,
    -- ...other columns...
    CONSTRAINT pk_fact_orders PRIMARY KEY (order_id)
)
DISTKEY(customer_id)            -- colocated with customer dimension
COMPOUND SORTKEY(order_date, customer_id)  -- date first (most queries), then customer
ENCODE AUTO;                    -- let Redshift choose compression per column

-- The companion dimension table
CREATE TABLE dim_customer (
    customer_id   BIGINT NOT NULL,
    customer_name VARCHAR(200) ENCODE zstd,
    segment       VARCHAR(50) ENCODE zstd,
    country       CHAR(2) ENCODE bytedict,
    CONSTRAINT pk_dim_customer PRIMARY KEY (customer_id)
)
DISTSTYLE KEY DISTKEY(customer_id)   -- colocated with fact table
SORTKEY(customer_id)
ENCODE AUTO;

Compression: Use ENCODE AUTO (But Understand the Alternatives)

Redshift offers per-column compression: RAW, BYTEDICT, DELTA, LZO, ZSTD, AZ64, RUNLENGTH. The right encoding depends on the column's data distribution:

  • AZ64 — Amazon's proprietary numeric encoding; best for numeric (INT, BIGINT, DECIMAL, DATE) columns. Better compression and faster decompression than LZO for numerics.
  • ZSTD — Good general-purpose for variable-length text with mixed distributions
  • BYTEDICT — Excellent for low-cardinality categoricals (status, country, type) — stores a dictionary instead of repeated strings
  • DELTA/DELTA32K — For monotonically increasing sequences (timestamps, IDs in order) — stores differences rather than values

In practice: specify ENCODE AUTO at table creation and run ANALYZE COMPRESSION table_name after loading a representative data sample. Redshift samples the data and recommends optimal encodings. Apply the recommendations with an ALTER TABLE or CREATE TABLE AS SELECT.

What We'd Do Differently

If we rebuilt the warehouse from scratch, knowing what we know now:

  1. Choose DISTKEY based on join patterns from day one. Map out all major joins before schema design. The most frequent large-to-large table join determines the DISTKEY. Don't change your mind later without a full table rebuild.
  2. Use COMPOUND sort keys with date as the leading column for all fact tables with time-based filtering. Interleaved sort keys are theoretically attractive but the VACUUM REINDEX overhead is real.
  3. Test for skew before launch. Generate a representative data volume, load it, and check skew_rows in SVV_TABLE_INFO. Fix it before you're operating at 50B rows.
  4. Plan the VACUUM schedule explicitly. For tables with regular UPDATEs, schedule VACUUM in maintenance windows. For INSERT-only tables, occasional ANALYZE is sufficient.
  5. Set ENCODE AUTO everywhere. Then run ANALYZE COMPRESSION after initial load and apply recommendations. The default encoding when you forget to specify is RAW, which is uncompressed — an easy way to spend 3x more on storage than necessary.

Redshift is a genuinely excellent analytical database for the right workloads. The physical design decisions matter more than in most modern systems because Redshift doesn't hide them from you — it gives you the knobs and expects you to turn them correctly. That's a double-edged sword: you can optimize aggressively, but you can also build yourself a slow-burning performance problem that doesn't show up until 100 million rows later.