Streaming Data Ingestion

On this page, you will:

Understand when to use streaming vs batch ingestion
Learn Kafka/Confluent fundamentals
Survey real-time use cases (clickstreams, IoT, CDC)
Review deployment options (Confluent Cloud vs self-hosted)
Plan your streaming infrastructure

Overview

You've built batch data pipelines that load data on schedules (hourly, daily). But some use cases require real-time data - within seconds of events occurring. This is where streaming data ingestion excels.

Streaming ingestion processes data continuously as events happen, rather than waiting for the next scheduled batch run.

┌─────────────────────────────────────────────────────────────────────────┐
│                    STREAMING DATA INGESTION                             │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Event Sources         Kafka/Confluent       Snowflake                 │
│  ──────────────        ───────────────       ─────────                 │
│                                                                         │
│  ┌──────────────┐     ┌──────────────┐      ┌──────────────┐          │
│  │ Web clicks   │────▶│              │      │              │          │
│  │ Mobile app   │────▶│  Kafka       │─────▶│  Snowpipe    │          │
│  │ IoT sensors  │────▶│  Topics      │      │  Streaming   │          │
│  │ DB changes   │────▶│              │      │              │          │
│  │ (CDC)        │     │              │      │  STREAMING   │          │
│  └──────────────┘     └──────────────┘      │  database    │          │
│                              │               └──────────────┘          │
│                              │                      │                  │
│                              ▼                      ▼                  │
│                       ┌──────────────┐      ┌──────────────┐          │
│                       │ Schema       │      │ dbt          │          │
│                       │ Registry     │      │ incremental  │          │
│                       │ (Avro/Proto) │      │ models       │          │
│                       └──────────────┘      └──────────────┘          │
│                                                                         │
│  Latency: Seconds (not hours)                                          │
│  Use case: Real-time dashboards, fraud detection, operational analytics│
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Streaming vs Batch Ingestion

When to Use Streaming

✅ Use streaming when:

Latency matters — need data within seconds/minutes (not hours)
High event volume — thousands of events per second
Real-time decision-making — fraud detection, dynamic pricing, alerting
Event-driven workflows — trigger actions based on specific events
Audit trails — capture every state change in a database

Examples: - Clickstream analytics (user behaviour tracking) - IoT sensor data (temperature, location, machine telemetry) - Fraud detection (flag suspicious transactions immediately) - Change Data Capture (replicate database changes in real-time) - Real-time dashboards (live KPIs updating every second)

When to Use Batch

✅ Use batch when:

Hourly/daily is sufficient — don't need sub-minute latency
Lower event volume — hundreds/thousands of events per day
Source system doesn't support streaming — APIs with rate limits, databases without CDC
Simpler infrastructure — avoid Kafka cluster management

Examples: - Daily sales reports - Weekly reference data updates (currencies, product catalogues) - SaaS platform data (HubSpot, Salesforce) loaded nightly - Historical data backfills

Comparison

Aspect	Batch (dlt/Airbyte)	Streaming (Kafka)
Latency	Minutes to hours	Seconds
Volume	Low to medium	High
Infrastructure	Simple	Complex (Kafka cluster)
Cost	Lower	Higher
Use cases	Reports, analytics	Real-time dashboards, alerts
Setup complexity	Low	Medium to high
When data arrives	On schedule	Continuously

You need both: Use streaming for real-time operational data, batch for everything else.

Kafka and Confluent Fundamentals

What is Kafka?

Apache Kafka is a distributed event streaming platform. It acts as a durable, high-throughput message queue that: - Receives events from producers (applications generating data) - Stores events in topics (categories of events) - Delivers events to consumers (applications processing data)

Core Concepts

Topics — Categories of events (e.g., web-clicks, orders, sensor-readings)

Partitions — Topics are split into partitions for parallel processing and scalability

Producers — Applications that write events to Kafka topics

Consumers — Applications that read events from Kafka topics

Consumer Groups — Multiple consumers that share the work of processing a topic

Schema Registry — Centralized repository for event schemas (Avro, Protobuf, JSON Schema)

What is Confluent?

Confluent is the company founded by Kafka's creators. They offer: - Confluent Cloud — Fully managed Kafka as a service - Confluent Platform — On-premises Kafka distribution with additional features - Kafka Connect — Pre-built connectors for databases, cloud storage, and data warehouses

Why Confluent Cloud?

Kafka is operationally complex to run (clusters, brokers, Zookeeper, monitoring). Confluent Cloud abstracts this away:

✅ Fully managed — No cluster management, automatic upgrades ✅ Elastic scaling — Scale up/down automatically based on load ✅ Global availability — Multi-region replication built-in ✅ Security — Encryption, authentication, authorisation out of the box ✅ Schema Registry — Managed schema versioning and validation

Trade-off: Higher cost (~$150-200/month minimum) vs self-hosted (~$100/month for Redpanda on ECS).

Real-Time Use Cases

1. Clickstream Analytics

Scenario: Track user behaviour on your website in real-time.

Flow:

Browser → JavaScript tracker → Kafka topic → Snowflake → Real-time dashboard

Business value: - See which pages users visit right now - Identify drop-off points in checkout flow - Personalise content based on recent behaviour

Latency requirement: Seconds

2. IoT Sensor Data

Scenario: Monitor temperature sensors in warehouses.

Flow:

IoT devices → MQTT broker → Kafka topic → Snowflake → Alert if temp > 25°C

Business value: - Prevent spoilage by detecting temperature spikes immediately - Historical trend analysis for optimisation - Predictive maintenance based on sensor patterns

Latency requirement: Seconds to minutes

3. Change Data Capture (CDC)

Scenario: Replicate your PostgreSQL production database to Snowflake for analytics without impacting production.

Flow:

PostgreSQL WAL → Debezium → Kafka topic → Snowflake → Analytics queries

Business value: - Real-time analytics without slowing production database - Audit trail of all database changes - Data replication for disaster recovery

Latency requirement: Seconds

4. Fraud Detection

Scenario: Detect suspicious payment transactions immediately.

Flow:

Payment gateway → Kafka topic → Fraud ML model → Alert/block transaction

Business value: - Stop fraudulent transactions before they complete - Reduce chargebacks and losses - Real-time risk scoring

Latency requirement: Sub-second

5. Real-Time Dashboards

Scenario: Live sales dashboard updating every second.

Flow:

E-commerce checkout → Kafka topic → Snowflake → Lightdash dashboard (auto-refresh)

Business value: - Monitor Black Friday sales in real-time - React quickly to marketing campaign performance - Executive dashboards with live KPIs

Latency requirement: Seconds

Deployment Options

Option 1: Confluent Cloud (Recommended)

What you get: Fully managed Kafka + Schema Registry + Kafka Connect

Pricing: - Basic cluster: ~$150/month (1 CKU - Confluent Kafka Unit) - Standard cluster: ~$300/month (improved SLA, multi-AZ) - Dedicated cluster: $500+/month (isolated resources, private networking) - Additional costs: Data transfer ($0.08/GB), storage ($0.08/GB/month)

Monthly estimate: ~$150-200 for small workload (1 GB/day ingress, 30-day retention)

Pros: - Zero infrastructure management - Automatic scaling - 99.95% SLA (Standard tier) - Global replication built-in - Enterprise security (SOC 2, GDPR)

Cons: - Higher cost than self-hosted - Less control over infrastructure

Best for: Teams that want to focus on data, not Kafka operations

Option 2: AWS MSK (Managed Streaming for Kafka)

What you get: Managed Kafka cluster on AWS

Pricing: - Serverless: Pay per GB ingested (~$0.08/GB) + storage - Provisioned: 3-broker cluster ~$250/month (kafka.m5.large)

Monthly estimate: ~$250-300 for provisioned cluster

Pros: - Integrated with AWS (VPC, IAM, CloudWatch) - Lower cost than Confluent Cloud for large workloads - More control than fully managed

Cons: - You manage Kafka configuration, monitoring, upgrades - No Schema Registry (need Confluent Schema Registry or AWS Glue) - More operational burden

Best for: AWS-centric teams comfortable with Kafka operations

Option 3: Redpanda Self-Hosted (Budget)

What you get: Kafka-compatible platform (simpler, faster)

Infrastructure: - 3-node Redpanda cluster on ECS Fargate - No Zookeeper required (simpler than Kafka) - Built-in Schema Registry

Monthly cost: ~$100 (3 × ECS tasks, ALB, storage)

Pros: - Lowest cost option - Kafka-compatible (drop-in replacement) - Simpler architecture than Kafka - Full control

Cons: - You manage everything (deployment, upgrades, scaling, monitoring) - Smaller community than Kafka - No managed Schema Registry

Best for: Cost-conscious teams with DevOps expertise

Decision Matrix

Factor	Confluent Cloud	AWS MSK	Redpanda Self-Hosted
Cost	$150-200/mo	$250-300/mo	$100/mo
Operational burden	Zero	Medium	High
AWS integration	Via connectors	Native	Via connectors
Schema Registry	Included	Separate (Glue)	Built-in
Scaling	Automatic	Manual	Manual
Best for	Production, focus on data	AWS-heavy shops	Budget, DevOps-savvy

Recommendation: Start with Confluent Cloud Basic for simplicity. Migrate to self-hosted later if cost becomes a concern at scale.

What You'll Build

In this section, you'll create a complete streaming data pipeline:

┌─────────────────────────────────────────────────────────────────────────┐
│                   STREAMING PIPELINE ARCHITECTURE                       │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  1. Confluent Cloud Setup                                               │
│     ┌────────────────────────────────────────────────────┐             │
│     │ Kafka Cluster (Basic tier)                        │             │
│     │ • Topic: order-events                             │             │
│     │ • Schema Registry (Avro schemas)                  │             │
│     │ • Kafka Connect (Snowflake Sink Connector)        │             │
│     └────────────────────────────────────────────────────┘             │
│                              │                                          │
│  2. Event Production         ▼                                          │
│     ┌────────────────────────────────────────────────────┐             │
│     │ Python Producer (Prefect task)                    │             │
│     │ • Generate sample order events                    │             │
│     │ • Validate against Avro schema                    │             │
│     │ • Publish to order-events topic                   │             │
│     └────────────────────────────────────────────────────┘             │
│                              │                                          │
│  3. Snowflake Integration    ▼                                          │
│     ┌────────────────────────────────────────────────────┐             │
│     │ Snowpipe Streaming                                │             │
│     │ • STREAMING.ORDER_EVENTS table                    │             │
│     │ • Sub-second latency                              │             │
│     │ • SVC_KAFKA_CONNECTOR service account             │             │
│     └────────────────────────────────────────────────────┘             │
│                              │                                          │
│  4. dbt Transformation       ▼                                          │
│     ┌────────────────────────────────────────────────────┐             │
│     │ Incremental dbt models                            │             │
│     │ • Process new events every 5 minutes              │             │
│     │ • ANALYTICS.fct_orders_realtime                   │             │
│     └────────────────────────────────────────────────────┘             │
│                              │                                          │
│  5. Monitoring               ▼                                          │
│     ┌────────────────────────────────────────────────────┐             │
│     │ Prefect + Confluent Cloud Metrics                 │             │
│     │ • Monitor consumer lag                            │             │
│     │ • Alert on connector failures                     │             │
│     │ • Track throughput and latency                    │             │
│     └────────────────────────────────────────────────────┘             │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

Core Pipeline (9 Pages)

You'll learn: 1. Kafka concepts (topics, partitions, schemas) 2. Set up Confluent Cloud (cluster, Schema Registry, connectors) 3. Configure Snowflake for streaming (STREAMING database, service account) 4. Deploy Snowflake Kafka Connector 5. Write Python producers 6. Monitor with Prefect 7. Verify end-to-end latency

Optional Advanced Topics (3 Pages)

Self-hosted MSK (alternative to Confluent Cloud)
Stream processing (ksqlDB or Flink for transformations)
Change Data Capture (Debezium for PostgreSQL replication)

Prerequisites

Before starting this section, ensure you have completed:

Orchestration — Prefect for task scheduling
Data Warehouse — Snowflake with role-based access
Batch Data Ingestion — Understanding of data loading patterns
Data Transformation — dbt for processing streamed data

You'll extend these foundations to handle real-time data.

Cost Summary

Confluent Cloud Approach (~$200/month)

Component	Monthly Cost
Confluent Cloud Basic (1 CKU)	$150
Data transfer (1 GB/day ingress)	$2.50
Storage (30-day retention, 30 GB)	$2.50
Schema Registry	Included
Kafka Connect	Included
AWS Secrets Manager (Confluent credentials)	$0.50
Snowflake compute (Snowpipe Streaming)	$20-50
Total	~$175-205/month

Self-Hosted Redpanda Approach (~$100/month)

Component	Monthly Cost
ECS Fargate (3 × Redpanda nodes)	$60
ALB	$20
Storage (EBS volumes)	$15
CloudWatch Logs	$5
Total	~$100/month

Comparison to Batch

Batch ingestion (dlt): $0/month (just compute) Streaming ingestion (Confluent Cloud): $200/month

When streaming is worth it: When latency reduction (hours → seconds) provides business value exceeding $200/month (e.g., fraud prevention, real-time personalisation).

Section Contents

Page	What You'll Do
Kafka Concepts	Understand topics, partitions, producers, consumers, Schema Registry
Deployment Options	Compare Confluent Cloud, MSK, Redpanda - detailed costs and decision criteria
Confluent Cloud Setup	Create cluster, configure Schema Registry, generate API keys
Snowflake Infrastructure	STREAMING database, SVC_KAFKA_CONNECTOR service account
Kafka Connect Snowflake	Deploy Snowflake Sink Connector, configure Snowpipe Streaming
Producing Events	Python producers with Avro schemas, Prefect task integration
Prefect Orchestration	Monitor consumer lag, orchestrate event-driven flows
Finishing Up	Verify end-to-end latency, cost review, next steps
MSK Self-Hosted (Optional)	AWS MSK cluster via Terraform, alternative to Confluent
Stream Processing (Optional)	ksqlDB or Flink for real-time transformations
Change Data Capture (Optional)	Debezium for PostgreSQL CDC, database replication

Why Streaming Matters

Streaming data ingestion unlocks use cases that batch processing cannot address:

Real-time fraud detection — Block suspicious transactions immediately, not hours later

Live operational dashboards — Monitor Black Friday sales as they happen, react to issues instantly

Event-driven automation — Trigger workflows when specific events occur (e.g., send welcome email when user signs up)

IoT monitoring — Alert maintenance teams immediately when sensors detect anomalies

Database replication — Keep analytics warehouse in sync with production database in real-time

Audit trails — Capture every change to critical data for compliance

You're building batch pipelines for most data. Now add streaming for the data that must be real-time.

Get Started

Begin by understanding Kafka fundamentals: topics, partitions, and consumer groups.

Continue to Kafka Concepts →