Batch Data Ingestion with dlt
In this section, you'll build data pipelines using dlt (data load tool) to ingest data from APIs and databases into your Snowflake data warehouse. These pipelines are orchestrated by Prefect, which you set up in the previous section.
What is Batch Data Ingestion?
Batch data ingestion is the process of extracting data from source systems and loading it into your data warehouse on a schedule. Unlike streaming ingestion (which processes data continuously), batch ingestion runs at defined intervals - hourly, daily, or weekly.
Common batch ingestion patterns:
- API extraction - Pull data from REST APIs (e.g. exchange rates, weather, SaaS platforms)
- Database replication - Copy data from operational databases (e.g. PostgreSQL, MySQL)
- File processing - Load data from files in S3 or other storage
Why dlt?
dlt is a Python library that simplifies data pipeline development. It handles the complex parts of data loading - schema management, incremental loading, and destination-specific optimisations - so you can focus on your data.
| Feature | dlt | Custom Python | Fivetran/Airbyte |
|---|---|---|---|
| Setup | uv add dlt |
Write from scratch | SaaS signup or self-host |
| Schema handling | Automatic inference and evolution | Manual DDL | Automatic |
| Incremental loading | Built-in with state management | Manual implementation | Built-in |
| Cost | Free (open source) | Engineering time | $1+ per MAR or hosting costs |
| Customisation | Full Python control | Full control | Limited to connectors |
| Connectors | Growing library + easy custom sources | Build everything | 300+ pre-built |
dlt is ideal when you need custom extraction logic or want to avoid SaaS costs. For standard SaaS sources (Salesforce, HubSpot, Stripe), consider Airbyte which has pre-built connectors.
Role in the Stack
dlt sits in the ingestion layer, loading raw data into dedicated loader databases:
┌─────────────────────────────────────────────────────────────────────────────┐
│ DATA SOURCES │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ REST APIs │ │ Databases │ │ Files (S3) │ │
│ │ (Exchange │ │ (PostgreSQL, │ │ (CSV, JSON, │ │
│ │ Rates, etc.) │ │ MySQL, etc.) │ │ Parquet) │ │
│ └────────┬────────┘ └────────┬────────┘ └────────┬────────┘ │
│ │ │ │ │
│ └────────────────────┼────────────────────┘ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ dlt Pipelines │ │
│ │ (Orchestrated by Prefect - retries, scheduling, alerts) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ┌───────────────┴───────────────┐ │
│ ▼ ▼ │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ DLT Database │ │ SNOWPIPE Database │ │
│ │ (dlt-loaded tables) │ │ (auto-ingest from S3) │ │
│ └─────────────────────────┘ └─────────────────────────┘ │
│ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ dbt Transformations │ │
│ │ (covered in Data Transformation section) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ ANALYTICS Database │ │
│ │ (transformed, analysis-ready) │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
What You'll Build
In this section, you'll create four data pipelines demonstrating different ingestion patterns:
┌─────────────────────────────────────────────────────────────────────────────┐
│ BATCH DATA INGESTION │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Pipeline 1: Currencies (API → Snowflake direct) │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Open Exchange │ dlt │ Snowflake │ Prefect│ Weekly refresh │ │
│ │ Rates API │ ────▶ │ DLT.OPEN_ │ ◀───── │ replace mode │ │
│ │ /currencies.json│ │ EXCHANGE_RATES │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ Pipeline 2: Exchange Rates (API → S3 → Snowpipe → Snowflake) │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Open Exchange │ dlt │ S3 Bucket │Snowpipe│ SNOWPIPE. │ │
│ │ Rates API │ ────▶ │ /exchange-rates/ │ ────▶ │ OPEN_EXCHANGE_ │ │
│ │ /historical/ │ │ │ │ RATES │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ Pipeline 3: Products (PostgreSQL → Snowflake) │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ Clever Cloud │ dlt │ Snowflake │ Prefect│ Daily 06:00 │ │
│ │ PostgreSQL │ ────▶ │ DLT.APPLICATION │ ◀───── │ Type 2 SCD │ │
│ │ (Products) │ │ _DATA │ │ │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
│ Pipeline 4: HubSpot (API → Snowflake) │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐ │
│ │ HubSpot CRM │ dlt │ Snowflake │ Prefect│ Daily 07:00 │ │
│ │ (contacts) │ ────▶ │ DLT.HUBSPOT │ ◀───── │ incremental │ │
│ └─────────────────┘ └─────────────────┘ └─────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Key patterns demonstrated:
- Direct API to Snowflake loading with dlt (simplest pattern)
- S3 staging with Snowpipe auto-ingestion for high-volume appends
- Database extraction with Type 2 SCD handling
- Prefect orchestration with schedules, retries, and alerting
- Incremental loading with merge keys for idempotency
Prerequisites
Before starting this section, ensure you have completed:
- Prefect Setup - Orchestration layer ready (Cloud or self-hosted)
- S3 Data Lake - S3 buckets for staging data
- Data Warehouse - Snowflake with storage integrations
You'll also need:
- Open Exchange Rates account (free tier available)
- Clever Cloud account with PostgreSQL database (free DEV tier)
Section Overview
| Page | What You'll Learn |
|---|---|
| 1. dlt Concepts | Core concepts: sources, resources, pipelines, destinations |
| 2. Project Setup | Repository structure, uv, pyproject.toml, VaultDocProvider |
| 3. Snowflake Infrastructure | DLT and SNOWPIPE databases, dedicated roles, Snowpipe module |
| 4. Credentials Setup | API keys and database credentials in AWS Secrets Manager |
| 5. Currencies Pipeline | Your first dlt pipeline — reference data direct to Snowflake |
| 6. Exchange Rates to S3 | Historical data via S3 staging and Snowpipe auto-ingestion |
| 7. Products Pipeline | Database extraction with incremental loading |
| 8. Prefect Orchestration | Wrap pipelines in Prefect flows with schedules and alerting |
| 9. HubSpot Pipeline | CRM data ingestion with the dlt HubSpot connector |
| 10. Finishing Up | Verification and next steps |
Cost Considerations
This section uses mostly free or low-cost services:
| Component | Cost |
|---|---|
| dlt | Free (open source) |
| Open Exchange Rates | Free tier: 1,000 requests/month |
| Clever Cloud PostgreSQL | Free DEV tier |
| Snowpipe | ~$0.06 per 1,000 files processed |
| S3 storage | ~$0.023/GB/month |
The main costs are your existing Snowflake compute (LOADING warehouse) and any Prefect infrastructure.
What's Next
Start by understanding the core dlt concepts - sources, resources, pipelines, and destinations. This foundation applies to all the pipelines you'll build.
Continue to dlt Concepts →