Skip to content

Build Your First Modern Data Stack

A step-by-step guide to building a production-grade data platform from scratch - the kind of architecture that handles real business data reliably, at scale, and without requiring a large team to maintain it.

By the end of this guide, you will have a fully working data stack: streaming events landing in your warehouse within seconds, batch pipelines running on a schedule, dbt models transforming raw data into clean analytics tables, and dashboards your team can actually use. Every piece of infrastructure is version-controlled, every pipeline is monitored, and every decision is explained.


What You'll Build

The guide uses a sales analytics use case to make the stack concrete. You will build a platform that:

  • Streams purchase events from a web app into Snowflake in real time via Kafka
  • Loads product catalogue data from a PostgreSQL database on a schedule with dlt
  • Fetches exchange rate data from a public API and stages it through S3
  • Syncs customer records from HubSpot via Airbyte
  • Transforms all four sources into clean dbt models - fact_purchases, dim_products, fct_exchange_rates, dim_customers
  • Joins everything into a sales mart with per-customer revenue in GBP and USD
  • Surfaces the results in a Lightdash dashboard with metrics defined in code

The use case is intentionally simple. The stack is not. You are building the real thing.


The Stack

Layer Tool Why
Streaming ingestion Confluent Cloud (Kafka) Managed Kafka with Schema Registry and Connect included
Batch ingestion - APIs and databases dlt Free, Python-native, handles schema inference automatically
Batch ingestion - SaaS Airbyte Hundreds of pre-built connectors, low-code configuration
Data warehouse Snowflake Separates compute from storage, excellent Terraform support
Transformation dbt SQL-native, version-controlled, tested models
BI and dashboards Lightdash Metrics defined in dbt YAML, not duplicated in the tool
Orchestration Prefect Modern Python-native replacement for Airflow
Observability Elementary + OpenMetadata dbt-native testing plus a full data catalogue
Infrastructure Terraform Everything as code - nothing created by hand
CI/CD GitHub Actions Terraform plans, dbt tests, and Prefect deployments
Secrets AWS Secrets Manager Centralised credential storage for all services

Where a tool has a self-hosted alternative, both options are covered. You can choose based on your budget and how much infrastructure you want to manage.


How This Guide Is Structured

Getting Started  →  Build  →  Maintain

Getting Started covers everything you need before writing any infrastructure code. You will set up your GitHub organisation, configure your local development environment, create accounts with AWS, Snowflake, and Prefect, and get Terraform running with remote state and CI/CD. This section is a prerequisite for everything that follows.

Build walks through each layer of the stack in dependency order - from the data warehouse and object storage up through ingestion, transformation, analytics, observability, streaming, and documentation. Each section explains the concepts, guides you through the Terraform and configuration, and ends with a summary of what you have built.

Maintain covers day-to-day operations: adding users, adding data sources, handling backfills, responding to incidents, and keeping the stack up to date.


What Makes This Guide Different

It is production-grade from the start. Infrastructure is managed with Terraform from day one. There are no "and now you would add monitoring in production" caveats - monitoring is built in as you go.

It is opinionated, but explains why. Every tool choice has a rationale. Where there are trade-offs between managed and self-hosted options, both are covered honestly.

It is incremental. Each section builds on the last. You end every section with something working, not a half-built system waiting for later chapters to make sense.

It uses real patterns. The Terraform modules and pipeline patterns in this guide are based on real client deployments, not toy examples. You can take them and adapt them directly.

It includes AI-assisted workflows. The Maintain section includes CLAUDE.md templates and Claude Code skills for your Terraform, dbt, and Prefect repositories - so an AI assistant can help with routine tasks like adding users or creating new data sources.


Before You Start

This guide assumes you are comfortable with:

  • Python (you will write dlt pipelines and Prefect flows)
  • SQL (you will write dbt models)
  • The command line
  • Git and GitHub

You do not need prior experience with any of the specific tools. Each section introduces concepts before diving into implementation.

You will need accounts with GitHub, AWS, Snowflake, Prefect, and Confluent Cloud - we will cover how to set these up in the getting started section. The Costs page breaks down what each service costs at the scale used in this guide.


Get Started

Begin with the Initial Setup →