Kafka Concepts

On this page, you will:

Understand Kafka's core components (topics, partitions, producers, consumers)
Learn how consumer groups enable parallel processing
Master offsets and ordering guarantees
Discover Schema Registry for data contracts
Compare Kafka to traditional message queues

Overview

Apache Kafka is a distributed event streaming platform. Unlike traditional message queues that delete messages after consumption, Kafka treats events as a durable, replayable log. This fundamental difference enables powerful capabilities: time travel (replay old events), multiple independent consumers, and guaranteed ordering.

Understanding Kafka's architecture is essential before building streaming pipelines.

┌─────────────────────────────────────────────────────────────────────────┐
│                        KAFKA ARCHITECTURE                               │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Producers              Kafka Cluster           Consumers               │
│  ─────────              ─────────────           ─────────               │
│                                                                         │
│  ┌──────────┐          ┌──────────────┐        ┌──────────┐            │
│  │ Web App  │─────────▶│ Topic: orders│───────▶│ Snowflake│            │
│  │ (Python) │          │              │        │ Connector│            │
│  └──────────┘          │ Partition 0  │        └──────────┘            │
│                        │ [msg1][msg2] │                                │
│  ┌──────────┐          │              │        ┌──────────┐            │
│  │ Mobile   │─────────▶│ Partition 1  │───────▶│ Analytics│            │
│  │ App      │          │ [msg3][msg4] │        │ Consumer │            │
│  └──────────┘          │              │        └──────────┘            │
│                        │ Partition 2  │                                │
│                        │ [msg5][msg6] │        ┌──────────┐            │
│                        └──────────────┘───────▶│ ML Model │            │
│                                                │ Consumer │            │
│                                                └──────────┘            │
│                                                                         │
│  Multiple producers → One topic → Multiple independent consumers        │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Topics: The Event Log

What is a Topic?

A topic is a named category of events. Think of it as an append-only log that stores events in order.

Examples: - order-events — All order-related events (created, shipped, delivered) - user-clicks — Clickstream data from website - sensor-readings — IoT temperature/humidity data

Topics are Durable

Unlike traditional queues where messages disappear after consumption, Kafka topics retain events for a configured period (e.g., 7 days, 30 days, or forever).

Benefits: - Replay — Reprocess old events if consumer logic changes - Multiple consumers — Different applications read the same events independently - Debugging — Inspect historical events to troubleshoot issues

Example:

Topic: order-events (retention: 7 days)

[Day 1] order_created → order_shipped → order_delivered
[Day 2] order_created → order_cancelled
[Day 3] order_created → order_shipped

Day 8: Day 1 events automatically deleted

Topic Configuration

# Create a topic with 3 partitions, 7-day retention
kafka-topics --create \
    --topic order-events \
    --partitions 3 \
    --replication-factor 3 \
    --config retention.ms=604800000  # 7 days in milliseconds

Key settings: - partitions — Number of parallel streams (more = higher throughput) - replication-factor — Copies of data for fault tolerance (3 = survive 2 broker failures) - retention.ms — How long to keep events (7 days = 604800000 ms)

Partitions: Parallelism and Ordering

What is a Partition?

A partition is a subdivision of a topic. Each partition is an ordered, immutable sequence of events.

Why partitions? - Parallelism — Multiple consumers read different partitions simultaneously - Scalability — Distribute load across many brokers - Ordering — Events in the same partition are guaranteed to be in order

Partition Structure

Topic: order-events (3 partitions)

Partition 0:
[offset 0] {"order_id": 101, "status": "created"}
[offset 1] {"order_id": 104, "status": "shipped"}
[offset 2] {"order_id": 107, "status": "delivered"}

Partition 1:
[offset 0] {"order_id": 102, "status": "created"}
[offset 1] {"order_id": 105, "status": "cancelled"}

Partition 2:
[offset 0] {"order_id": 103, "status": "created"}
[offset 1] {"order_id": 106, "status": "shipped"}

Offset — Sequential ID for each message in a partition (starts at 0)

Partition Keys

Producers use a partition key to determine which partition receives an event.

Strategy 1: Key-based partitioning

# All events for order_id 101 go to the same partition
producer.send(
    topic='order-events',
    key=str(order_id).encode('utf-8'),  # Partition key
    value=json.dumps(event).encode('utf-8')
)

Result: All events for order_id=101 are in the same partition, guaranteeing order.

Strategy 2: Round-robin (no key)

# Events distributed evenly across partitions
producer.send(
    topic='order-events',
    value=json.dumps(event).encode('utf-8')  # No key
)

Result: Maximum throughput, but no ordering guarantee across partitions.

Ordering Guarantees

✅ Guaranteed: Events in the same partition are ordered ❌ Not guaranteed: Events across different partitions may arrive out of order

Example:

Order 101 events (same partition key = order_id):
Partition 0: created → shipped → delivered ✓ Ordered

Order 102 and 103 (different partitions):
Partition 1: Order 102 created (timestamp 10:00:01)
Partition 2: Order 103 created (timestamp 10:00:00)

Consumer may see Order 103 before Order 102 ✗ Not globally ordered

Best practice: Use partition keys (e.g., user_id, order_id) when event order matters.

Producers: Writing Events

What is a Producer?

A producer is an application that writes events to a Kafka topic.

Example: Python producer

from confluent_kafka import Producer
import json

# Configure producer
conf = {
    'bootstrap.servers': 'pkc-abc123.eu-west-2.aws.confluent.cloud:9092',
    'security.protocol': 'SASL_SSL',
    'sasl.mechanism': 'PLAIN',
    'sasl.username': 'API_KEY',
    'sasl.password': 'API_SECRET'
}

producer = Producer(conf)

# Produce an event
event = {
    'order_id': 101,
    'customer_id': 'cust_456',
    'status': 'created',
    'timestamp': '2026-02-20T10:00:00Z'
}

producer.produce(
    topic='order-events',
    key=str(event['order_id']).encode('utf-8'),  # Partition by order_id
    value=json.dumps(event).encode('utf-8'),
    callback=delivery_report  # Confirm delivery
)

producer.flush()  # Wait for all messages to be sent

Producer Guarantees

At-least-once delivery (default): - Events are retried on failure - May result in duplicates if retry succeeds after timeout

Exactly-once delivery (optional, higher overhead): - Idempotent producer + transactional writes - No duplicates, but slower

Configuration:

conf = {
    'acks': 'all',  # Wait for all replicas to acknowledge
    'retries': 10,  # Retry up to 10 times
    'enable.idempotence': True  # Exactly-once semantics
}

Consumers: Reading Events

What is a Consumer?

A consumer is an application that reads events from a Kafka topic.

Example: Python consumer

from confluent_kafka import Consumer

conf = {
    'bootstrap.servers': 'pkc-abc123.eu-west-2.aws.confluent.cloud:9092',
    'group.id': 'snowflake-sink',  # Consumer group
    'auto.offset.reset': 'earliest'  # Start from beginning
}

consumer = Consumer(conf)
consumer.subscribe(['order-events'])

while True:
    msg = consumer.poll(1.0)  # Poll for messages (1 second timeout)

    if msg is None:
        continue
    if msg.error():
        print(f"Error: {msg.error()}")
        continue

    # Process event
    event = json.loads(msg.value().decode('utf-8'))
    print(f"Received order {event['order_id']}: {event['status']}")

    # Commit offset (mark as processed)
    consumer.commit()

Offsets: Tracking Progress

An offset is the position of a consumer in a partition.

Partition 0: [0][1][2][3][4][5][6][7][8][9]
                        ↑
                  Consumer offset: 4
                  (Next message to read: 5)

Offset management:

Auto-commit (default):

conf = {'enable.auto.commit': True}  # Commit every 5 seconds

Manual commit (more control):

conf = {'enable.auto.commit': False}

# Process message
process_event(msg)

# Commit offset only after successful processing
consumer.commit()

Best practice: Use manual commit to avoid losing messages if consumer crashes mid-processing.

Consumer Groups: Load Balancing

What is a Consumer Group?

A consumer group is a set of consumers that share the work of reading a topic.

How it works: - Each partition is consumed by exactly one consumer in the group - If a consumer fails, its partitions are reassigned to others

Example: 3 Consumers, 3 Partitions

Topic: order-events (3 partitions)

Consumer Group: snowflake-sink

┌──────────────┐       ┌──────────────┐       ┌──────────────┐
│ Consumer 1   │       │ Consumer 2   │       │ Consumer 3   │
│ reads        │       │ reads        │       │ reads        │
│ Partition 0  │       │ Partition 1  │       │ Partition 2  │
└──────────────┘       └──────────────┘       └──────────────┘

Benefits: - Parallel processing — 3 consumers = 3× throughput - Fault tolerance — If Consumer 1 crashes, Partition 0 is reassigned to Consumer 2 or 3

Example: More Consumers Than Partitions

Topic: order-events (3 partitions)

Consumer Group: snowflake-sink (5 consumers)

┌──────────────┐       ┌──────────────┐       ┌──────────────┐
│ Consumer 1   │       │ Consumer 2   │       │ Consumer 3   │
│ Partition 0  │       │ Partition 1  │       │ Partition 2  │
└──────────────┘       └──────────────┘       └──────────────┘

┌──────────────┐       ┌──────────────┐
│ Consumer 4   │       │ Consumer 5   │
│ (idle)       │       │ (idle)       │
└──────────────┘       └──────────────┘

Result: Consumers 4 and 5 are idle. You can't have more active consumers than partitions.

Best practice: Number of consumers ≤ number of partitions for optimal utilisation.

Multiple Consumer Groups

Different applications can read the same topic independently:

Topic: order-events

Consumer Group 1: snowflake-sink (writes to Snowflake)
Consumer Group 2: analytics-engine (real-time aggregations)
Consumer Group 3: fraud-detector (ML model scoring)

Each group maintains its own offsets independently.

Schema Registry: Data Contracts

Why Schema Registry?

Problem: Producers and consumers need to agree on event structure.

Without Schema Registry:

# Producer changes event structure
producer.send({'order_id': 101, 'total': 99.99})  # Old format

producer.send({'order_id': 102, 'total_amount': 120.50})  # New format

# Consumer breaks on new format
event = json.loads(msg)
print(event['total'])  # KeyError: 'total' (field renamed!)

With Schema Registry: - Schemas are versioned and stored centrally - Producers must use a valid schema - Consumers use schema to deserialise correctly - Backward/forward compatibility enforced

Schema Formats

Avro (Recommended)

Binary format, compact, schema evolution support

{
  "type": "record",
  "name": "Order",
  "namespace": "com.company.orders",
  "fields": [
    {"name": "order_id", "type": "long"},
    {"name": "customer_id", "type": "string"},
    {"name": "total", "type": "double"},
    {"name": "status", "type": "string"},
    {"name": "created_at", "type": "long", "logicalType": "timestamp-millis"}
  ]
}

Pros: - Compact (binary encoding, ~40% smaller than JSON) - Schema evolution (add fields with defaults without breaking consumers) - Fast serialisation/deserialisation

Cons: - Not human-readable (need tools to inspect) - More complex than JSON

Use when: Production workloads with high throughput

Protobuf

Google's binary format, similar to Avro

syntax = "proto3";

message Order {
  int64 order_id = 1;
  string customer_id = 2;
  double total = 3;
  string status = 4;
  int64 created_at = 5;
}

Pros: - Backward/forward compatible - Code generation for multiple languages - Very fast

Cons: - More boilerplate than Avro - Schema Registry support added later (less mature)

Use when: Already using Protobuf elsewhere, need code generation

JSON Schema

JSON with schema validation

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "order_id": {"type": "integer"},
    "customer_id": {"type": "string"},
    "total": {"type": "number"},
    "status": {"type": "string"},
    "created_at": {"type": "string", "format": "date-time"}
  },
  "required": ["order_id", "customer_id", "total"]
}

Pros: - Human-readable - Familiar to developers - Easy debugging

Cons: - Larger message size (~2-3× larger than Avro) - Slower serialisation - Weaker schema evolution

Use when: Development/debugging, low throughput, human inspection required

Schema Evolution

Backward compatibility: New consumers can read old events

// Old schema (v1)
{
  "fields": [
    {"name": "order_id", "type": "long"},
    {"name": "total", "type": "double"}
  ]
}

// New schema (v2) - backward compatible
{
  "fields": [
    {"name": "order_id", "type": "long"},
    {"name": "total", "type": "double"},
    {"name": "currency", "type": "string", "default": "GBP"}  // New field with default
  ]
}

Forward compatibility: Old consumers can read new events (new fields ignored)

Schema Registry enforces these rules, preventing breaking changes.

Kafka Connect: Pre-Built Integrations

What is Kafka Connect?

Kafka Connect is a framework for integrating Kafka with external systems using connectors.

Source connectors — Pull data into Kafka: - Database CDC (Debezium) - Cloud storage (S3, GCS) - Message queues (RabbitMQ, SQS)

Sink connectors — Push data from Kafka: - Snowflake (Snowpipe Streaming) - Elasticsearch - S3 (data lake)

Example: Snowflake Sink Connector

{
  "name": "snowflake-sink",
  "config": {
    "connector.class": "com.snowflake.kafka.connector.SnowflakeSinkConnector",
    "topics": "order-events",
    "snowflake.url.name": "your-account.snowflakecomputing.com",
    "snowflake.user.name": "SVC_KAFKA_CONNECTOR",
    "snowflake.private.key": "...",
    "snowflake.database.name": "STREAMING",
    "snowflake.schema.name": "PUBLIC",
    "buffer.count.records": "10000",
    "buffer.flush.time": "60",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter",
    "value.converter": "io.confluent.connect.avro.AvroConverter"
  }
}

Benefits: - No custom code needed - Automatic retries and offset management - Scaling handled by Kafka Connect

You'll configure this in page 5: Kafka Connect Snowflake.

Kafka vs Traditional Message Queues

Feature	Kafka	RabbitMQ/SQS
Durability	Events stored for days/weeks	Messages deleted after consumption
Replay	✅ Reprocess old events	❌ Once consumed, gone forever
Multiple consumers	✅ Many independent consumer groups	❌ One consumer group only
Ordering	✅ Per-partition ordering	⚠️ Limited (FIFO queues in SQS, complex in RabbitMQ)
Throughput	✅ Millions of events/sec	⚠️ Thousands of messages/sec
Latency	⚠️ Milliseconds (batch-oriented)	✅ Sub-millisecond (message-oriented)
Use case	Event streaming, data pipelines	Task queues, job processing

When to use Kafka: - High-throughput event streams - Multiple consumers need the same data - Need to replay historical events - Building real-time analytics pipelines

When to use RabbitMQ/SQS: - Task queues (job processing) - Low-latency message delivery - Simple pub/sub without persistence - Don't need replay capability

Summary

You've learned Kafka fundamentals:

Topics — Append-only logs that retain events for a configured period
Partitions — Enable parallelism and guarantee ordering within a partition
Producers — Write events to topics with partition keys for ordering
Consumers — Read events, track offsets, commit progress
Consumer groups — Load balance across multiple consumers
Schema Registry — Enforce data contracts with Avro/Protobuf/JSON Schema
Kafka Connect — Pre-built connectors for databases, warehouses, cloud storage

Kafka's durability and replayability make it ideal for streaming data pipelines.

What's Next

Compare deployment options: Confluent Cloud, AWS MSK, or self-hosted Redpanda.

Continue to Deployment Options →