SaaS Data Ingestion
On this page, you will:
- Understand when to use Airbyte vs dlt for SaaS data
- Learn Airbyte's role in the modern data stack
- Plan the infrastructure needed for SaaS ingestion
Overview
This section covers setting up Airbyte for SaaS data ingestion. Airbyte provides 600+ pre-built connectors, a UI for non-engineers, and reverse ETL capabilities.
If you only need one or two simple SaaS sources, you may not need Airbyte at all. The HubSpot Pipeline page shows how to add a SaaS connector using dlt's verified sources within your existing batch pipeline infrastructure.
This section is for when you outgrow that approach.
┌─────────────────────────────────────────────────────────────────────────────┐
│ SAAS DATA INGESTION LAYER │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ SaaS Sources Airbyte Snowflake │
│ ──────────── ─────── ───────── │
│ │
│ ┌─────────────────┐ ┌─────────────────────────┐ │
│ │ HubSpot │ │ │ ┌───────────────┐ │
│ │ (CRM data) │────▶│ Airbyte Cloud │────▶│ AIRBYTE. │ │
│ │ │ │ or Self-Hosted │ │ HUBSPOT │ │
│ └─────────────────┘ │ │ └───────────────┘ │
│ │ ┌───────────────────┐ │ │
│ ┌─────────────────┐ │ │ HubSpot Connector │ │ ┌───────────────┐ │
│ │ Salesforce │────▶│ │ Salesforce Conn. │ │────▶│ AIRBYTE. │ │
│ │ (example) │ │ │ Stripe Connector │ │ │ SALESFORCE │ │
│ └─────────────────┘ │ └───────────────────┘ │ └───────────────┘ │
│ │ │ │
│ ┌─────────────────┐ │ Reverse ETL: │ │
│ │ Stripe │────▶│ Snowflake → SaaS │ │
│ │ (example) │ │ │ │
│ └─────────────────┘ └─────────────────────────┘ │
│ │ │
│ ┌──────────▼──────────┐ │
│ │ Prefect │ │
│ │ • Trigger syncs │ │
│ │ • Monitor status │ │
│ │ • Alerting │ │
│ └─────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
When to Use Airbyte
| Scenario | Recommended Tool |
|---|---|
| Simple REST APIs (Exchange Rates, etc.) | dlt |
| Database extraction (PostgreSQL, MySQL) | dlt |
| 1-2 SaaS sources with dlt verified sources | dlt |
| Complex SaaS (Salesforce, NetSuite, Workday) | Airbyte |
| Non-engineers need to configure connections | Airbyte |
| Reverse ETL (Snowflake → HubSpot/Salesforce) | Airbyte |
| 5+ SaaS sources | Airbyte |
| Cost-sensitive, code-first team | dlt |
Why not Airbyte for everything?
You might wonder why the batch ingestion section uses dlt rather than Airbyte for all data sources. The reasons are:
| Consideration | dlt | Airbyte |
|---|---|---|
| Simple REST APIs | Native Python, full control | Overkill — connector overhead |
| Custom extraction logic | Easy to customise | Requires forking a connector |
| Database extraction | sql_database source works well |
Debezium-based, more complex |
| Cost | Free (open source) | Per-record/credit pricing |
| Debugging | Standard Python debugging | Container logs, UI inspection |
Why not dlt for everything?
dlt works well for SaaS sources when verified sources exist (like HubSpot). But Airbyte has advantages for complex SaaS:
- OAuth flows: Salesforce, Google Ads, and similar require complex OAuth — Airbyte handles this via its UI
- Schema management: Airbyte auto-detects and tracks upstream schema changes in SaaS tools
- Connector breadth: 600+ connectors means coverage for almost any tool
- Reverse ETL: Airbyte supports syncing data from Snowflake back to SaaS tools
- Non-engineer access: The Airbyte UI lets non-engineers add and configure connections
What You Will Build
By the end of this section:
Snowflake
├── AIRBYTE database
│ ├── HUBSPOT schema
│ │ └── CONTACTS table (synced from HubSpot CRM)
│ └── (additional schemas per SaaS source)
│
├── SVC_AIRBYTE service account
│ └── SVC_AIRBYTE dedicated role (created by user module)
│ └── Granted AIRBYTE_DB_WRITER
│
└── Role hierarchy
└── AIRBYTE_DB_READER → ANALYTICS_SOURCES_READER
└── Analysts can read all Airbyte-loaded data
Prefect orchestration:
Prefect
├── hubspot-airbyte-daily flow
│ └── Triggers Airbyte sync via API
└── Automations
└── Alerts on sync failure
Infrastructure Decisions
Airbyte Cloud vs Self-Hosted
| Factor | Airbyte Cloud | Self-Hosted (ECS) |
|---|---|---|
| Setup | Sign up, configure via UI | Deploy infrastructure with Terraform |
| Cost | $99+/month (Starter) | ~$81/month (ECS + RDS) |
| Maintenance | Managed by Airbyte | You manage upgrades, scaling |
| Connectors | All available | All available |
| Best for | Getting started, small teams | Cost optimisation at scale |
This section documents both approaches.
Database Naming
Following the same pattern as the batch ingestion section (databases named after the loader tool):
| Loader | Database | Example Schema |
|---|---|---|
| dlt | DLT |
DLT.OPEN_EXCHANGE_RATES |
| Snowpipe | SNOWPIPE |
SNOWPIPE.OPEN_EXCHANGE_RATES |
| Airbyte | AIRBYTE |
AIRBYTE.HUBSPOT |
Role Pattern
The SVC_AIRBYTE service account uses the user_create_dedicated_role = true pattern from the user Terraform module. This automatically creates a SVC_AIRBYTE role that is granted to the user. The database module then grants AIRBYTE_DB_WRITER to this role.
For read access, AIRBYTE_DB_READER is granted to ANALYTICS_SOURCES_READER, so all analyst roles can query Airbyte-loaded data through the existing role hierarchy.
Section Contents
| Page | What You Will Do |
|---|---|
| Airbyte Concepts | Learn sources, destinations, connections, and sync modes |
| Deployment Options | Compare Cloud vs self-hosted, choose your approach |
| Airbyte Cloud Setup | Create workspace, generate API credentials |
| Self-Hosted Setup | Deploy Airbyte on ECS with Terraform (optional) |
| Snowflake Infrastructure | Create AIRBYTE database, SVC_AIRBYTE user, role grants |
| HubSpot Connection | Configure HubSpot source, Snowflake destination, first sync |
| Reverse ETL | Sync enriched data from Snowflake back to SaaS tools |
| Prefect Orchestration | Trigger syncs from Prefect, scheduling, monitoring |
| Finishing Up | Verification, cost summary, next steps |
Prerequisites
Before starting this section, ensure you have completed:
- Snowflake Terraform Setup - Snowflake managed by Terraform
- Data Warehouse - Databases, roles, and users module
- Orchestration - Prefect Cloud or self-hosted
- Batch Data Ingestion - Understanding of dlt patterns (recommended)
Cost Summary
| Approach | Monthly Cost | Notes |
|---|---|---|
| dlt only (Option A from batch section) | $0 | Free, uses existing infrastructure |
| Airbyte Cloud | $99+ | Starter tier, per-record overage |
| Airbyte Self-Hosted | ~$81 | ECS + RDS infrastructure costs |
Detailed cost breakdowns are covered in Deployment Options and the Costs page.
Get Started
Continue to Airbyte Concepts →