Skip to content

Kafka Concepts

On this page, you will:

  • Understand Kafka's core components (topics, partitions, producers, consumers)
  • Learn how consumer groups enable parallel processing
  • Master offsets and ordering guarantees
  • Discover Schema Registry for data contracts
  • Compare Kafka to traditional message queues

Overview

Apache Kafka is a distributed event streaming platform. Unlike traditional message queues that delete messages after consumption, Kafka treats events as a durable, replayable log. This fundamental difference enables powerful capabilities: time travel (replay old events), multiple independent consumers, and guaranteed ordering.

Understanding Kafka's architecture is essential before building streaming pipelines.

┌─────────────────────────────────────────────────────────────────────────┐
│                        KAFKA ARCHITECTURE                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Producers              Kafka Cluster           Consumers               │
│  ─────────              ─────────────           ─────────               │
│                                                                         │
│  ┌──────────┐          ┌──────────────┐        ┌──────────┐            │
│  │ Web App  │─────────▶│ Topic: orders│───────▶│ Snowflake│            │
│  │ (Python) │          │              │        │ Connector│            │
│  └──────────┘          │ Partition 0  │        └──────────┘            │
│                        │ [msg1][msg2] │                                │
│  ┌──────────┐          │              │        ┌──────────┐            │
│  │ Mobile   │─────────▶│ Partition 1  │───────▶│ Analytics│            │
│  │ App      │          │ [msg3][msg4] │        │ Consumer │            │
│  └──────────┘          │              │        └──────────┘            │
│                        │ Partition 2  │                                │
│                        │ [msg5][msg6] │        ┌──────────┐            │
│                        └──────────────┘───────▶│ ML Model │            │
│                                                │ Consumer │            │
│                                                └──────────┘            │
│                                                                         │
│  Multiple producers → One topic → Multiple independent consumers        │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Topics: The Event Log

What is a Topic?

A topic is a named category of events. Think of it as an append-only log that stores events in order.

Examples: - order-events — All order-related events (created, shipped, delivered) - user-clicks — Clickstream data from website - sensor-readings — IoT temperature/humidity data

Topics are Durable

Unlike traditional queues where messages disappear after consumption, Kafka topics retain events for a configured period (e.g., 7 days, 30 days, or forever).

Benefits: - Replay — Reprocess old events if consumer logic changes - Multiple consumers — Different applications read the same events independently - Debugging — Inspect historical events to troubleshoot issues

Example:

Topic: order-events (retention: 7 days)

[Day 1] order_created → order_shipped → order_delivered
[Day 2] order_created → order_cancelled
[Day 3] order_created → order_shipped

Day 8: Day 1 events automatically deleted

Topic Configuration

# Create a topic with 3 partitions, 7-day retention
kafka-topics --create \
    --topic order-events \
    --partitions 3 \
    --replication-factor 3 \
    --config retention.ms=604800000  # 7 days in milliseconds

Key settings: - partitions — Number of parallel streams (more = higher throughput) - replication-factor — Copies of data for fault tolerance (3 = survive 2 broker failures) - retention.ms — How long to keep events (7 days = 604800000 ms)

Partitions: Parallelism and Ordering

What is a Partition?

A partition is a subdivision of a topic. Each partition is an ordered, immutable sequence of events.

Why partitions? - Parallelism — Multiple consumers read different partitions simultaneously - Scalability — Distribute load across many brokers - Ordering — Events in the same partition are guaranteed to be in order

Partition Structure

Topic: order-events (3 partitions)

Partition 0:
[offset 0] {"order_id": 101, "status": "created"}
[offset 1] {"order_id": 104, "status": "shipped"}
[offset 2] {"order_id": 107, "status": "delivered"}

Partition 1:
[offset 0] {"order_id": 102, "status": "created"}
[offset 1] {"order_id": 105, "status": "cancelled"}

Partition 2:
[offset 0] {"order_id": 103, "status": "created"}
[offset 1] {"order_id": 106, "status": "shipped"}

Offset — Sequential ID for each message in a partition (starts at 0)

Partition Keys

Producers use a partition key to determine which partition receives an event.

Strategy 1: Key-based partitioning

# All events for order_id 101 go to the same partition
producer.send(
    topic='order-events',
    key=str(order_id).encode('utf-8'),  # Partition key
    value=json.dumps(event).encode('utf-8')
)

Result: All events for order_id=101 are in the same partition, guaranteeing order.

Strategy 2: Round-robin (no key)

# Events distributed evenly across partitions
producer.send(
    topic='order-events',
    value=json.dumps(event).encode('utf-8')  # No key
)

Result: Maximum throughput, but no ordering guarantee across partitions.

Ordering Guarantees

Guaranteed: Events in the same partition are ordered ❌ Not guaranteed: Events across different partitions may arrive out of order

Example:

Order 101 events (same partition key = order_id):
Partition 0: created → shipped → delivered ✓ Ordered

Order 102 and 103 (different partitions):
Partition 1: Order 102 created (timestamp 10:00:01)
Partition 2: Order 103 created (timestamp 10:00:00)

Consumer may see Order 103 before Order 102 ✗ Not globally ordered

Best practice: Use partition keys (e.g., user_id, order_id) when event order matters.

Producers: Writing Events

What is a Producer?

A producer is an application that writes events to a Kafka topic.

Example: Python producer

from confluent_kafka import Producer
import json

# Configure producer
conf = {
    'bootstrap.servers': 'pkc-abc123.eu-west-2.aws.confluent.cloud:9092',
    'security.protocol': 'SASL_SSL',
    'sasl.mechanism': 'PLAIN',
    'sasl.username': 'API_KEY',
    'sasl.password': 'API_SECRET'
}

producer = Producer(conf)

# Produce an event
event = {
    'order_id': 101,
    'customer_id': 'cust_456',
    'status': 'created',
    'timestamp': '2026-02-20T10:00:00Z'
}

producer.produce(
    topic='order-events',
    key=str(event['order_id']).encode('utf-8'),  # Partition by order_id
    value=json.dumps(event).encode('utf-8'),
    callback=delivery_report  # Confirm delivery
)

producer.flush()  # Wait for all messages to be sent

Producer Guarantees

At-least-once delivery (default): - Events are retried on failure - May result in duplicates if retry succeeds after timeout

Exactly-once delivery (optional, higher overhead): - Idempotent producer + transactional writes - No duplicates, but slower

Configuration:

conf = {
    'acks': 'all',  # Wait for all replicas to acknowledge
    'retries': 10,  # Retry up to 10 times
    'enable.idempotence': True  # Exactly-once semantics
}

Consumers: Reading Events

What is a Consumer?

A consumer is an application that reads events from a Kafka topic.

Example: Python consumer

from confluent_kafka import Consumer

conf = {
    'bootstrap.servers': 'pkc-abc123.eu-west-2.aws.confluent.cloud:9092',
    'group.id': 'snowflake-sink',  # Consumer group
    'auto.offset.reset': 'earliest'  # Start from beginning
}

consumer = Consumer(conf)
consumer.subscribe(['order-events'])

while True:
    msg = consumer.poll(1.0)  # Poll for messages (1 second timeout)

    if msg is None:
        continue
    if msg.error():
        print(f"Error: {msg.error()}")
        continue

    # Process event
    event = json.loads(msg.value().decode('utf-8'))
    print(f"Received order {event['order_id']}: {event['status']}")

    # Commit offset (mark as processed)
    consumer.commit()

Offsets: Tracking Progress

An offset is the position of a consumer in a partition.

Partition 0: [0][1][2][3][4][5][6][7][8][9]
                        ↑
                  Consumer offset: 4
                  (Next message to read: 5)

Offset management:

Auto-commit (default):

conf = {'enable.auto.commit': True}  # Commit every 5 seconds

Manual commit (more control):

conf = {'enable.auto.commit': False}

# Process message
process_event(msg)

# Commit offset only after successful processing
consumer.commit()

Best practice: Use manual commit to avoid losing messages if consumer crashes mid-processing.

Consumer Groups: Load Balancing

What is a Consumer Group?

A consumer group is a set of consumers that share the work of reading a topic.

How it works: - Each partition is consumed by exactly one consumer in the group - If a consumer fails, its partitions are reassigned to others

Example: 3 Consumers, 3 Partitions

Topic: order-events (3 partitions)

Consumer Group: snowflake-sink

┌──────────────┐       ┌──────────────┐       ┌──────────────┐
│ Consumer 1   │       │ Consumer 2   │       │ Consumer 3   │
│ reads        │       │ reads        │       │ reads        │
│ Partition 0  │       │ Partition 1  │       │ Partition 2  │
└──────────────┘       └──────────────┘       └──────────────┘

Benefits: - Parallel processing — 3 consumers = 3× throughput - Fault tolerance — If Consumer 1 crashes, Partition 0 is reassigned to Consumer 2 or 3

Example: More Consumers Than Partitions

Topic: order-events (3 partitions)

Consumer Group: snowflake-sink (5 consumers)

┌──────────────┐       ┌──────────────┐       ┌──────────────┐
│ Consumer 1   │       │ Consumer 2   │       │ Consumer 3   │
│ Partition 0  │       │ Partition 1  │       │ Partition 2  │
└──────────────┘       └──────────────┘       └──────────────┘

┌──────────────┐       ┌──────────────┐
│ Consumer 4   │       │ Consumer 5   │
│ (idle)       │       │ (idle)       │
└──────────────┘       └──────────────┘

Result: Consumers 4 and 5 are idle. You can't have more active consumers than partitions.

Best practice: Number of consumers ≤ number of partitions for optimal utilisation.

Multiple Consumer Groups

Different applications can read the same topic independently:

Topic: order-events

Consumer Group 1: snowflake-sink (writes to Snowflake)
Consumer Group 2: analytics-engine (real-time aggregations)
Consumer Group 3: fraud-detector (ML model scoring)

Each group maintains its own offsets independently.

Schema Registry: Data Contracts

Why Schema Registry?

Problem: Producers and consumers need to agree on event structure.

Without Schema Registry:

# Producer changes event structure
producer.send({'order_id': 101, 'total': 99.99})  # Old format

producer.send({'order_id': 102, 'total_amount': 120.50})  # New format

# Consumer breaks on new format
event = json.loads(msg)
print(event['total'])  # KeyError: 'total' (field renamed!)

With Schema Registry: - Schemas are versioned and stored centrally - Producers must use a valid schema - Consumers use schema to deserialise correctly - Backward/forward compatibility enforced

Schema Formats

Binary format, compact, schema evolution support

{
  "type": "record",
  "name": "Order",
  "namespace": "com.company.orders",
  "fields": [
    {"name": "order_id", "type": "long"},
    {"name": "customer_id", "type": "string"},
    {"name": "total", "type": "double"},
    {"name": "status", "type": "string"},
    {"name": "created_at", "type": "long", "logicalType": "timestamp-millis"}
  ]
}

Pros: - Compact (binary encoding, ~40% smaller than JSON) - Schema evolution (add fields with defaults without breaking consumers) - Fast serialisation/deserialisation

Cons: - Not human-readable (need tools to inspect) - More complex than JSON

Use when: Production workloads with high throughput

Protobuf

Google's binary format, similar to Avro

syntax = "proto3";

message Order {
  int64 order_id = 1;
  string customer_id = 2;
  double total = 3;
  string status = 4;
  int64 created_at = 5;
}

Pros: - Backward/forward compatible - Code generation for multiple languages - Very fast

Cons: - More boilerplate than Avro - Schema Registry support added later (less mature)

Use when: Already using Protobuf elsewhere, need code generation

JSON Schema

JSON with schema validation

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "order_id": {"type": "integer"},
    "customer_id": {"type": "string"},
    "total": {"type": "number"},
    "status": {"type": "string"},
    "created_at": {"type": "string", "format": "date-time"}
  },
  "required": ["order_id", "customer_id", "total"]
}

Pros: - Human-readable - Familiar to developers - Easy debugging

Cons: - Larger message size (~2-3× larger than Avro) - Slower serialisation - Weaker schema evolution

Use when: Development/debugging, low throughput, human inspection required

Schema Evolution

Backward compatibility: New consumers can read old events

// Old schema (v1)
{
  "fields": [
    {"name": "order_id", "type": "long"},
    {"name": "total", "type": "double"}
  ]
}

// New schema (v2) - backward compatible
{
  "fields": [
    {"name": "order_id", "type": "long"},
    {"name": "total", "type": "double"},
    {"name": "currency", "type": "string", "default": "GBP"}  // New field with default
  ]
}

Forward compatibility: Old consumers can read new events (new fields ignored)

Schema Registry enforces these rules, preventing breaking changes.

Kafka Connect: Pre-Built Integrations

What is Kafka Connect?

Kafka Connect is a framework for integrating Kafka with external systems using connectors.

Source connectors — Pull data into Kafka: - Database CDC (Debezium) - Cloud storage (S3, GCS) - Message queues (RabbitMQ, SQS)

Sink connectors — Push data from Kafka: - Snowflake (Snowpipe Streaming) - Elasticsearch - S3 (data lake)

Example: Snowflake Sink Connector

{
  "name": "snowflake-sink",
  "config": {
    "connector.class": "com.snowflake.kafka.connector.SnowflakeSinkConnector",
    "topics": "order-events",
    "snowflake.url.name": "your-account.snowflakecomputing.com",
    "snowflake.user.name": "SVC_KAFKA_CONNECTOR",
    "snowflake.private.key": "...",
    "snowflake.database.name": "STREAMING",
    "snowflake.schema.name": "PUBLIC",
    "buffer.count.records": "10000",
    "buffer.flush.time": "60",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "io.confluent.connect.avro.AvroConverter"
  }
}

Benefits: - No custom code needed - Automatic retries and offset management - Scaling handled by Kafka Connect

You'll configure this in page 5: Kafka Connect Snowflake.

Kafka vs Traditional Message Queues

Feature Kafka RabbitMQ/SQS
Durability Events stored for days/weeks Messages deleted after consumption
Replay ✅ Reprocess old events ❌ Once consumed, gone forever
Multiple consumers ✅ Many independent consumer groups ❌ One consumer group only
Ordering ✅ Per-partition ordering ⚠️ Limited (FIFO queues in SQS, complex in RabbitMQ)
Throughput ✅ Millions of events/sec ⚠️ Thousands of messages/sec
Latency ⚠️ Milliseconds (batch-oriented) ✅ Sub-millisecond (message-oriented)
Use case Event streaming, data pipelines Task queues, job processing

When to use Kafka: - High-throughput event streams - Multiple consumers need the same data - Need to replay historical events - Building real-time analytics pipelines

When to use RabbitMQ/SQS: - Task queues (job processing) - Low-latency message delivery - Simple pub/sub without persistence - Don't need replay capability

Summary

You've learned Kafka fundamentals:

  • Topics — Append-only logs that retain events for a configured period
  • Partitions — Enable parallelism and guarantee ordering within a partition
  • Producers — Write events to topics with partition keys for ordering
  • Consumers — Read events, track offsets, commit progress
  • Consumer groups — Load balance across multiple consumers
  • Schema Registry — Enforce data contracts with Avro/Protobuf/JSON Schema
  • Kafka Connect — Pre-built connectors for databases, warehouses, cloud storage

Kafka's durability and replayability make it ideal for streaming data pipelines.

What's Next

Compare deployment options: Confluent Cloud, AWS MSK, or self-hosted Redpanda.

Continue to Deployment Options