Runbook: Backfills
Summary
Reprocess historical data when source corrections, schema changes, or logic updates require rebuilding pipeline output. This covers dlt pipeline backfills, Airbyte resyncs, Snowpipe replays, and dbt full refreshes.
When to Use
- Source data was corrected retroactively and downstream tables are stale
- A pipeline schema change requires reloading historical data
- An incremental dbt model's logic changed and needs rebuilding
- A new data source is added and historical data must be loaded
- Data quality issues were found in previously loaded data
Prerequisites
- Access: Prefect Cloud (to trigger runs), Snowflake (to verify data), relevant repository
- Context: Date range for the backfill, affected pipelines and models
- Timing: Schedule during low-usage hours - backfills consume significant compute
Coordinate Backfills
Large backfills can spike Snowflake credit usage and slow down other workloads. Notify the team before running backfills against production. Consider using a dedicated warehouse sized appropriately for the volume.
Steps
1. Identify the Backfill Scope
Determine which layer needs backfilling:
| Symptom | Backfill Layer | Action |
|---|---|---|
| Source API data was corrected | Ingestion | Re-run the pipeline for affected dates |
| dlt/Airbyte schema changed | Ingestion | Full re-sync of the source |
| dbt incremental logic changed | Transformation | dbt run --full-refresh on affected models |
| New source added, need history | Ingestion | Run historical backfill flow |
| Downstream aggregations are wrong | Transformation | Rebuild affected models and downstream |
2. Backfill Ingestion Layer
Use the dedicated backfill flows created during the Prefect Orchestration setup.
Trigger via Prefect CLI:
prefect deployment run exchange-rates-backfill/production \
--param start_date=2026-01-01 \
--param end_date=2026-02-28
Or trigger via Prefect UI:
- Navigate to Deployments → find the backfill deployment
- Click Run → Custom
- Set
start_dateandend_dateparameters - Click Submit
For pipelines without a dedicated backfill flow, create a one-off flow or modify the pipeline's write_disposition:
# Temporarily use "replace" to reload all data
pipeline.run(source(), write_disposition="replace")
Replace vs Merge
Using write_disposition="replace" drops and recreates the table. For incremental sources, prefer "merge" with a primary key to upsert records without losing data that isn't in the backfill range.
- Navigate to Airbyte → Connections → select the connection
- Click the Settings tab
- Under Sync mode, click Reset data for the affected streams
- Click Save and then Sync now
Airbyte will clear the destination table and reload from the source. For large sources, this can take significant time.
Partial Resyncs
Airbyte does not support date-range backfills natively. A reset reloads all data for the stream. If you need a partial backfill, consider using dlt for that specific load.
Snowpipe does not replay automatically. To backfill:
- Re-upload the source files to S3 with new filenames (Snowpipe tracks processed files)
-
Or use
ALTER PIPE ... REFRESHto force reprocessing:USE ROLE SYSADMIN; ALTER PIPE SNOWPIPE.OPEN_EXCHANGE_RATES.EXCHANGE_RATES_PIPE REFRESH PREFIX = 'exchange-rates/2026/01/'; -
Verify the pipe status:
SELECT SYSTEM$PIPE_STATUS('SNOWPIPE.OPEN_EXCHANGE_RATES.EXCHANGE_RATES_PIPE');
3. Backfill Transformation Layer (dbt)
After ingestion backfills complete, rebuild affected dbt models.
Full refresh a single model:
dbt run --select fct_exchange_rates --full-refresh
Full refresh a model and everything downstream:
dbt run --select fct_exchange_rates+ --full-refresh
Full refresh all models from a source:
dbt run --select source:dlt_open_exchange_rates+ --full-refresh
Run tests after the backfill:
dbt test --select fct_exchange_rates+
Incremental Models Only
--full-refresh only affects incremental models. Views and ephemeral models rebuild automatically. Tables without incremental config are always fully rebuilt.
4. Handle Large Backfills
For backfills that process significant data volumes:
-
Use a larger warehouse temporarily:
USE ROLE SYSADMIN; ALTER WAREHOUSE TRANSFORMING SET WAREHOUSE_SIZE = 'LARGE'; -
Run the backfill:
dbt run --select fct_exchange_rates+ --full-refresh -
Scale back down immediately after:
ALTER WAREHOUSE TRANSFORMING SET WAREHOUSE_SIZE = 'X-SMALL'; -
Monitor credit consumption in Snowflake → Admin → Cost Management during the backfill.
Verification
- Pipeline run completed successfully in Prefect UI
-
Row counts match expectations:
SELECT COUNT(*), MIN(rate_date), MAX(rate_date) FROM <DATABASE>.<SCHEMA>.<TABLE>; -
No duplicate records (for merge-based pipelines):
SELECT primary_key, COUNT(*) FROM <TABLE> GROUP BY primary_key HAVING COUNT(*) > 1; -
dbt tests pass on affected models:
dbt test --select <model>+ -
Source freshness check passes:
dbt source freshness -
Downstream dashboards show correct data
Rollback
If the backfill introduces data issues:
-
Snowflake Time Travel - restore tables to their pre-backfill state:
-- Restore to state before backfill (within retention period, default 1 day) CREATE OR REPLACE TABLE <DATABASE>.<SCHEMA>.<TABLE> CLONE <DATABASE>.<SCHEMA>.<TABLE> AT (OFFSET => -3600); -
dbt incremental models - Time Travel works here too:
CREATE OR REPLACE TABLE ANALYTICS.MARTS.FCT_EXCHANGE_RATES CLONE ANALYTICS.MARTS.FCT_EXCHANGE_RATES AT (OFFSET => -3600); -
Re-run the normal incremental pipeline to resume from the restored state
Time Travel Retention
Default retention is 1 day (Standard edition) or up to 90 days (Enterprise edition). Check your account's DATA_RETENTION_TIME_IN_DAYS setting. For critical backfills, consider taking an explicit clone as a backup before starting.
Escalation
- First contact: Data Engineering team in #data-eng Slack
- Escalation: Pipeline team lead (for ingestion), analytics engineering lead (for dbt)
See Also
- Prefect Orchestration - Backfill flow patterns
- dbt Testing and Documentation - Test strategies
- Snowflake Monitoring - Credit usage monitoring