Kafka Internals: Partitions, Replication, and Why Your Consumer Group Is Stuck

Apache Kafka is the nervous system of modern data architectures. Event streaming platforms built on Kafka handle everything from clickstream data at billions of events per day to financial transaction processing to real-time ML feature pipelines. Yet the vast majority of teams using Kafka treat it as a black box — messages go in, messages come out, and debugging happens by restarting things until they work.

Understanding Kafka's internals changes that. Once you know what a partition actually is, why the ISR matters, how consumer group rebalancing works, and what "exactly-once" really means under the hood, you can diagnose production problems in minutes instead of hours. This article covers the core Kafka architecture with enough depth to reason about production behavior.

The Fundamental Abstraction: The Log

Everything in Kafka is built on a single abstraction: the commit log. A Kafka topic partition is an immutable, ordered, append-only sequence of records stored on disk. Each record gets a sequential integer offset. Producers append to the end; consumers read forward from any offset they choose.

This design choice — an append-only log rather than a queue with delete-on-consume — is what makes Kafka different from traditional message queues (RabbitMQ, SQS). Records stay in the log for a configurable retention period (default 7 days or by size limit). Multiple independent consumer groups can each maintain their own offset pointer and replay from any position. A consumer group that falls behind doesn't lose messages — it just reads older records when it catches up. This "replay" capability is what makes Kafka suitable as an event store, not just a transport layer.

Topics, Partitions, and Segments

A topic is a logical category. A partition is the physical unit — an ordered log on a single broker's disk. Topics have one or more partitions; more partitions = more parallelism for both producers and consumers, up to the limit of broker disk I/O and network throughput.

Each partition is stored as a directory of segment files on the broker filesystem. The active segment receives new writes; older segments are read-only and eligible for retention-based deletion. Segment boundaries matter for compaction and cleanup operations — Kafka's log cleaner works at the segment level.

graph TD
    subgraph Topic["orders-topic (3 partitions)"]
        subgraph P0["Partition 0 (Leader: Broker 1)"]
            S0["Segment 0\noffsets 0-999\n(sealed)"]
            S1["Segment 1\noffsets 1000-1999\n(sealed)"]
            S2["Segment 2\noffsets 2000+\n(active)"]
        end
        subgraph P1["Partition 1 (Leader: Broker 2)"]
            S3["Segment 0\noffsets 0-1499"]
            S4["Segment 1\noffsets 1500+\n(active)"]
        end
        subgraph P2["Partition 2 (Leader: Broker 3)"]
            S5["Segment 0\noffsets 0+\n(active)"]
        end
    end

    subgraph Replicas["Replica distribution (RF=2)"]
        B1["Broker 1\nLeader P0, Follower P1"]
        B2["Broker 2\nLeader P1, Follower P2"]
        B3["Broker 3\nLeader P2, Follower P0"]
    end

A topic with 3 partitions and replication factor 2. Each partition has one leader (handles reads and writes) and one follower (replicates from leader). Partition leaders are distributed across brokers to balance load. If Broker 1 goes down, the follower on Broker 3 is elected as the new P0 leader.

Replication: ISR and the Replication Lag Problem

Kafka replicates each partition across multiple brokers. The replication factor (typically 3 for production) determines how many copies exist. For each partition, one broker is the leader (handles all reads and writes) and the rest are followers (passively replicate from the leader).

The In-Sync Replica (ISR) list is the critical concept. A replica is "in sync" if it has replicated all messages within a configurable lag threshold (replica.lag.time.max.ms, default 30s). The ISR list starts with all replicas; a replica drops out if it falls too far behind. The leader only acknowledges a write to the producer once all ISR members have replicated it (when acks=all).

This is why acks=all + min.insync.replicas=2 is the production durability configuration: a write is only acknowledged once at least 2 replicas have it on disk. If the leader crashes immediately after acknowledging, at least one other broker has the data and can be elected as the new leader.

from confluent_kafka import Producer

producer = Producer({
    'bootstrap.servers': 'broker1:9092,broker2:9092',
    'acks':              'all',           # wait for all ISR replicas
    'enable.idempotence': True,           # exactly-once at producer level
    'retries':           10,
    'retry.backoff.ms':  100,
    'compression.type':  'lz4',           # good default for throughput
    'linger.ms':         5,               # batch up to 5ms for throughput
    'batch.size':        65536,           # 64KB batch size
})

def delivery_report(err, msg):
    if err:
        print(f'Delivery failed: {err}')
    else:
        print(f'Delivered to {msg.topic()}[{msg.partition()}] @ offset {msg.offset()}')

producer.produce(
    topic='orders',
    key='customer-123',              # key determines partition assignment
    value='{"order_id": "abc", ...}',
    callback=delivery_report
)
producer.flush()  # wait for all in-flight messages to be delivered

Leader Election

When a partition leader fails, Kafka's controller (now a Raft-based KRaft cluster in Kafka 3.x, previously ZooKeeper) elects a new leader from the ISR. The election is fast (<30 seconds with well-tuned configs) but not instantaneous — during the window between leader failure and new leader election, the partition is unavailable. unclean.leader.election.enable=false (the default) prevents election of an out-of-sync replica, trading availability for consistency.

Consumer Groups: Parallelism and the Rebalance Problem

A consumer group is a set of consumers that cooperatively consume a topic. Kafka assigns partitions to consumers: each partition is consumed by exactly one consumer in the group at a time. With 12 partitions and 4 consumers, each consumer handles 3 partitions. This is the parallelism model — adding consumers increases throughput up to the number of partitions.

The most common production problem: consumer group rebalances. A rebalance is triggered whenever: a consumer joins or leaves the group, a consumer crashes or stops sending heartbeats, or a consumer exceeds max.poll.interval.ms (default 5 minutes — the maximum time between poll() calls). During a rebalance, all consumption stops until partitions are reassigned.

Consumer groups get "stuck" (stop making progress) for two common reasons:

Processing time exceeds max.poll.interval.ms: The consumer fetches a batch, processes records slowly (heavy DB writes, external API calls), and takes longer than 5 minutes. Kafka considers the consumer dead and triggers a rebalance. The consumer is removed, partitions are reassigned, and the offsets reset to the last committed position — processing the same records again, causing another timeout, causing another rebalance. Fix: reduce batch size (max.poll.records), increase max.poll.interval.ms, or move slow processing out of the poll loop.
Offset commit lag: The consumer is processing records but not committing offsets. On restart, it replays from the last committed offset, reprocessing potentially thousands of records. Fix: commit offsets more frequently, or use enable.auto.commit=false with explicit manual commits after processing.

from confluent_kafka import Consumer

consumer = Consumer({
    'bootstrap.servers':   'broker1:9092',
    'group.id':            'order-processor-v2',
    'auto.offset.reset':   'earliest',
    'enable.auto.commit':  False,          # manual commit for control
    'max.poll.interval.ms': 300000,        # 5 minutes
    'session.timeout.ms':   10000,         # 10s heartbeat timeout
    'heartbeat.interval.ms': 3000,
})

consumer.subscribe(['orders'])

try:
    while True:
        msg = consumer.poll(timeout=1.0)
        if msg is None:
            continue
        if msg.error():
            print(f"Consumer error: {msg.error()}")
            continue

        process_order(msg.value())         # your processing logic

        # Commit only after successful processing
        consumer.commit(asynchronous=False)  # synchronous for safety
finally:
    consumer.close()

Exactly-Once Semantics

Kafka's exactly-once semantics (EOS) is one of the most misunderstood features. There are three delivery guarantees:

At-most-once: Fire and forget. Messages may be lost. Never use for data pipelines.
At-least-once: Retry until acknowledged. Messages may be duplicated. The default for most Kafka applications.
Exactly-once: Each message is processed and produced exactly once, even across producer retries and consumer restarts. Requires idempotent producers + transactions.

Idempotent producers (enabled with enable.idempotence=true) assign a sequence number to each batch. If a batch is retried (network failure, broker restart), the broker recognizes the duplicate sequence number and deduplicates — solving the at-least-once problem for producer retries.

Kafka transactions extend this to consume-process-produce pipelines: read from a source topic, transform, write to a sink topic — atomically. Either the read offset commit and the write both happen, or neither does. This is the EOS pattern used by Kafka Streams and Flink for stateful streaming pipelines.

Log Compaction

Standard Kafka topics retain messages by time or size and delete old segments. Log compaction is an alternative: keep only the latest message for each key. A compacted topic becomes an eventually-consistent key-value store — great for materializing changelogs, maintaining current state (CDC events, configuration tables), and keeping topics small without losing current values.

The log cleaner runs in the background, merging segment files and retaining only the latest record per key. The head of the log (most recent segment) is never compacted — only older segments. During active compaction, you may read both old and new values for the same key.

Kafka on Cloud: MSK vs Confluent vs Azure Event Hubs

Platform	What it is	Kafka-compatible?	Best for	Pricing model
Amazon MSK	Managed Apache Kafka on AWS	100% — native Kafka API	AWS teams wanting full Kafka compatibility	Per broker-hour + storage
MSK Serverless	Serverless Kafka on AWS	100% — native Kafka API	Variable workloads, no capacity planning	Per partition-hour + data throughput
Confluent Cloud	Fully managed Kafka + ecosystem	100% + extras (ksqlDB, connectors, Schema Registry)	Teams wanting managed connectors + schema management	Per CKU + data throughput
Azure Event Hubs	Azure managed event streaming with Kafka endpoint	Partial — Kafka protocol supported, not all features	Azure-native teams, existing Event Hubs investment	Per throughput unit + data volume
GCP Pub/Sub	Google's managed messaging (not Kafka)	No — different API, different semantics	GCP-native, simple pub/sub without Kafka complexity	Per message + data volume

Azure Event Hubs' Kafka compatibility is good enough for producers and simple consumers but has gaps: no support for transactions, limited admin API compatibility, no log compaction. If you need full Kafka semantics on Azure, Confluent Cloud on Azure or self-managed Kafka on AKS is more reliable than Event Hubs.

Partition Count: The Decision You Can't Easily Undo

Increasing partition count on an existing topic requires a rebalance of all consumer groups and doesn't automatically redistribute existing data. Start with more partitions than you think you need. The general rule: number of partitions ≥ max expected consumer parallelism × 2. For high-throughput topics, 12–30 partitions is common. For low-throughput topics with ordering requirements, 1–3 partitions per logical key space.

The ordering trap: Kafka only guarantees order within a single partition. If you need strict ordering for a set of events (all events for the same user_id must be processed in order), they must all go to the same partition — achieved by setting the message key to user_id. If you use random partitioning or change your key strategy, ordering guarantees break immediately.

Kafka's power comes from its simplicity at the core — an append-only log, a replication protocol, and offset management — combined with the flexibility to compose these primitives into exactly the streaming architecture you need. The problems that look mysterious (stuck consumers, rebalance storms, mysterious duplicates) almost always trace back to these fundamentals. Know the log, know the ISR, know the rebalance trigger conditions, and Kafka production operations become predictable.