Runbook: Disaster Recovery
Summary
Recover from data loss, infrastructure failures, and service outages across the data stack. This covers Snowflake Time Travel, Terraform state recovery, pipeline failover, and backup strategies.
When to Use
- Data was accidentally deleted or corrupted in Snowflake
- A Terraform apply destroyed resources unexpectedly
- Terraform state file is corrupted or lost
- A critical pipeline has been failing for an extended period
- An AWS service outage affects the data platform
- Snowflake account access is compromised
Prerequisites
- Access: Snowflake with ACCOUNTADMIN role
- Access: AWS with
infrastructure-adminprofile - Access: Terraform state bucket (S3) and lock table (DynamoDB)
- Context: What was lost, when it happened, and the impact scope
Steps
1. Assess the Situation
Before taking any action, determine:
| Question | Why It Matters |
|---|---|
| What was affected? | Scopes the recovery effort |
| When did it happen? | Determines Time Travel availability |
| Is the issue ongoing? | May need to stop further damage first |
| Who is impacted? | Determines urgency and communication needs |
| Is data recoverable? | Time Travel, backups, or source re-ingestion |
Stop the Bleeding First
If an automated process is causing ongoing damage (e.g. a misconfigured pipeline overwriting data), disable it before attempting recovery. Pause the Prefect deployment, revert the Terraform change, or suspend the Snowflake warehouse.
2. Recover Snowflake Data
Using Time Travel
Snowflake retains historical data for the retention period (1 day Standard, up to 90 days Enterprise):
Restore a dropped table:
USE ROLE SYSADMIN;
UNDROP TABLE <DATABASE>.<SCHEMA>.<TABLE>;
Restore a table to a point in time:
-- Restore to 1 hour ago
CREATE OR REPLACE TABLE <DATABASE>.<SCHEMA>.<TABLE>
CLONE <DATABASE>.<SCHEMA>.<TABLE> AT (OFFSET => -3600);
-- Restore to a specific timestamp
CREATE OR REPLACE TABLE <DATABASE>.<SCHEMA>.<TABLE>
CLONE <DATABASE>.<SCHEMA>.<TABLE> AT (TIMESTAMP => '2026-03-01 12:00:00'::TIMESTAMP);
-- Restore to before a specific query
CREATE OR REPLACE TABLE <DATABASE>.<SCHEMA>.<TABLE>
CLONE <DATABASE>.<SCHEMA>.<TABLE> BEFORE (STATEMENT => '<query_id>');
Restore an entire schema:
CREATE OR REPLACE SCHEMA <DATABASE>.<SCHEMA>
CLONE <DATABASE>.<SCHEMA> AT (OFFSET => -3600);
Restore an entire database:
CREATE OR REPLACE DATABASE <DATABASE>
CLONE <DATABASE> AT (OFFSET => -3600);
Time Travel Limits
Time Travel only works within the retention period. After that, data enters Fail-safe (7 days, Snowflake support access only). Beyond Fail-safe, data is unrecoverable from Snowflake.
Using Clones as Backups
For critical operations (large backfills, schema migrations), create a backup clone first:
-- Create a backup before a risky operation
CREATE DATABASE <DATABASE>_BACKUP CLONE <DATABASE>;
-- If something goes wrong, swap back
ALTER DATABASE <DATABASE> RENAME TO <DATABASE>_BROKEN;
ALTER DATABASE <DATABASE>_BACKUP RENAME TO <DATABASE>;
-- Clean up after confirming recovery
DROP DATABASE <DATABASE>_BROKEN;
Zero-copy clones are free (they share storage until data diverges).
3. Recover Terraform State
Corrupted State File
S3 versioning is enabled on the Terraform state bucket. Restore a previous version:
# List state file versions
aws s3api list-object-versions \
--bucket your-org-terraform-state \
--prefix snowflake/terraform.tfstate \
--profile infrastructure-admin
# Download a specific version
aws s3api get-object \
--bucket your-org-terraform-state \
--key snowflake/terraform.tfstate \
--version-id <VERSION_ID> \
restored-state.tfstate \
--profile infrastructure-admin
# Upload the restored state
aws s3 cp restored-state.tfstate \
s3://your-org-terraform-state/snowflake/terraform.tfstate \
--profile infrastructure-admin
DynamoDB Lock
If a Terraform operation was interrupted, the DynamoDB lock may be stuck. Check and remove it:
aws dynamodb scan \
--table-name your-org-terraform-locks \
--profile infrastructure-admin
If a stale lock exists, delete it via the AWS Console or CLI. Only do this if you are certain no other Terraform operation is running.
Destroyed Resources
If terraform apply destroyed resources unexpectedly:
- Check the Terraform plan output to understand what was destroyed
- Revert the PR that caused the destruction
- Re-apply to recreate the resources from the reverted configuration
- Re-import any resources that can't be recreated automatically (e.g.
terraform import) - Restore data using Time Travel if databases or schemas were dropped
4. Recover Pipeline State
Prefect Flow Failures
If a critical pipeline has been failing:
- Check the flow run logs in Prefect UI for the root cause
- Fix the underlying issue (credentials, source availability, schema change)
- Determine the data gap - what date range was missed
- Run a backfill to fill the gap (see Backfills)
- Verify downstream models are correct after the backfill
dlt Pipeline State
dlt tracks pipeline state (last loaded record, incremental cursors) in its state store. If this is corrupted:
# Reset pipeline state (forces full reload on next run)
cd ~/projects/data/data-pipelines
python -c "
import dlt
pipeline = dlt.pipeline(pipeline_name='<pipeline_name>')
pipeline.drop()
"
Then run a full reload or backfill.
5. Recover from AWS Outages
If an AWS service outage affects the platform:
| Affected Service | Impact | Mitigation |
|---|---|---|
| S3 | Data lake unavailable, Snowpipe paused | Wait for recovery; Snowpipe auto-catches up |
| Secrets Manager | Pipelines can't authenticate | Use cached credentials if available; wait for recovery |
| ECS | Self-hosted services down (Prefect, Airbyte, Lightdash) | Wait for recovery; consider cloud alternatives |
| DynamoDB | Terraform state locking unavailable | Wait; do not run Terraform without locking |
Monitor AWS service health at status.aws.amazon.com.
6. Recover from Account Compromise
If a Snowflake or AWS account may be compromised:
-
Rotate all service account credentials immediately:
# Generate new key pairs for each service account openssl genrsa 2048 | openssl pkcs8 -topk8 -inform PEM -out new_key.pem -nocrypt openssl rsa -in new_key.pem -pubout -out new_key.pub -
Update Snowflake public keys:
USE ROLE SECURITYADMIN; ALTER USER SVC_DLT SET RSA_PUBLIC_KEY = '<new public key>'; ALTER USER SVC_DBT SET RSA_PUBLIC_KEY = '<new public key>'; ALTER USER SVC_AIRBYTE SET RSA_PUBLIC_KEY = '<new public key>'; -
Update AWS Secrets Manager with the new private keys
-
Review Snowflake access history for suspicious activity:
SELECT * FROM SNOWFLAKE.ACCOUNT_USAGE.LOGIN_HISTORY WHERE event_timestamp > DATEADD(day, -7, CURRENT_TIMESTAMP()) ORDER BY event_timestamp DESC; -
Enable network policies to restrict access if not already in place
- Report the incident to your security team
Verification
- Affected data restored to the correct state
- All pipelines running successfully
- dbt tests pass on affected models
- Dashboards showing correct data
- No residual errors in Prefect, dbt, or Snowflake logs
- Credentials rotated (if compromise was involved)
Rollback
Recovery operations are themselves at risk of causing issues:
- Before restoring - create a clone of the current state as a safety net
- After restoring - verify data integrity before deleting backups
- Keep clones for 24-48 hours after recovery to ensure no secondary issues emerge
Escalation
- First contact: Data Engineering team in #data-eng Slack
- Escalation: Infrastructure team lead
- Security incidents: Security team immediately, then infrastructure team lead
- Snowflake platform issues: Snowflake Support (Enterprise accounts)
See Also
- Backfills - Historical data reprocessing
- Security Hardening - Key rotation and audit logging
- Alerting and Incidents - Incident response workflows