Runbook: Troubleshooting
Summary
Diagnose and resolve common issues across the data stack. This runbook provides systematic debugging approaches for each layer - Snowflake, dbt, Prefect, dlt, Airbyte, and infrastructure.
When to Use
- A pipeline, model, or query is failing and the cause is unclear
- An alert fires and the standard runbook doesn't resolve the issue
- Something worked before and now doesn't
Prerequisites
- Access: Relevant tool dashboards (Prefect, Snowflake, dbt Cloud)
- Access: AWS CloudWatch Logs (for ECS-hosted services)
- Context: Error message, affected component, when it last worked
Steps
1. General Debugging Approach
For any issue, follow this sequence:
- Read the error message - most errors are descriptive
- Check when it last worked - what changed since then?
- Check the logs - Prefect UI, CloudWatch, Snowflake query history
- Reproduce locally (if possible) - run the failing pipeline or model in dev
- Fix, test, deploy - fix in a branch, test locally, create a PR
2. Snowflake Issues
Authentication Errors
| Error | Cause | Fix |
|---|---|---|
Incorrect username or password |
Wrong credentials | Check .dlt/secrets.toml or Secrets Manager |
JWT token is invalid |
Expired or wrong key pair | Rotate the key (see Security Hardening) |
User is disabled |
User account disabled | Re-enable: ALTER USER <name> SET DISABLED = FALSE; |
IP not allowed |
Network policy blocking | Check network policy assignments and allowed IPs |
Permission Errors
Insufficient privileges to operate on <object>
-
Check the user's current role:
SELECT CURRENT_USER(), CURRENT_ROLE(); -
Check what roles are granted:
SHOW GRANTS TO USER <username>; -
Check what the role can access:
SHOW GRANTS TO ROLE <role_name>; -
Common fix - the user needs a different role. Either switch roles or update Terraform:
-- Temporary fix USE ROLE <correct_role>; -- Permanent fix via Terraform -- Add the role to user_additional_roles in users.tf or users.auto.tfvars
Query Timeouts
Statement reached its maximum execution time of 300 seconds
- Check the query profile in Snowsight → Activity → Query History
- Identify the bottleneck (spilling, full scan, exploding JOIN)
-
Temporary fix - increase timeout:
ALTER SESSION SET STATEMENT_TIMEOUT_IN_SECONDS = 600; -
Permanent fix - optimise the query (see Performance Optimisation)
Warehouse Suspended
Warehouse is suspended. Resume warehouse to execute queries.
If AUTO_RESUME = TRUE (which it should be), this error usually indicates a resource monitor has suspended the warehouse:
SHOW RESOURCE MONITORS;
-- Check if any monitor has hit its credit limit
To resume temporarily:
USE ROLE SYSADMIN;
ALTER WAREHOUSE <warehouse> RESUME;
To increase the credit limit:
USE ROLE ACCOUNTADMIN;
ALTER RESOURCE MONITOR <monitor> SET CREDIT_QUOTA = <new_limit>;
3. dbt Issues
Model Compilation Errors
Compilation Error in model <model_name>
- Check the error details - usually a missing ref, source, or macro
- Common causes:
- Missing source definition in
_sources.yml - Typo in
{{ ref('model_name') }}or{{ source('name', 'table') }} - Package not installed: run
dbt deps
- Missing source definition in
Test Failures
Failure in test <test_name>
-
Check what data failed:
dbt test --select <test_name> --store-failuresFailed rows are stored in the
dbt_test__auditschema. -
Query the failures:
SELECT * FROM ANALYTICS_DEV.DBT_TEST__AUDIT.<test_name>; -
Common causes:
- Source data changed (new NULL values, duplicates)
- Upstream model bug introduced in a recent PR
- Test threshold too strict (e.g. freshness, accepted values)
Incremental Model Drift
If an incremental model is producing wrong results:
-
Run a full refresh to rebuild from scratch:
dbt run --select <model> --full-refresh -
Compare row counts before and after
- If the issue recurs, check the
is_incremental()filter logic
Schema Changes
Compilation Error: column "new_column" does not exist
Source schemas evolve. When upstream tools add or rename columns:
-
Check what changed in the source table:
DESCRIBE TABLE <DATABASE>.<SCHEMA>.<TABLE>; -
Update the staging model to handle the new schema
- Update
_sources.ymlif table structure changed significantly - Consider
on_schema_change='append_new_columns'for incremental models that should absorb new columns automatically
4. Prefect Issues
Flow Run Failed
- Check the flow run logs in Prefect UI → Flow Runs → click the failed run
-
Common causes:
Error Pattern Cause Fix ModuleNotFoundErrorMissing dependency Add to pyproject.tomlwithuv addConnectionRefusedErrorSource API or database down Wait and retry PermissionErrorMissing credentials or expired token Check Secrets Manager TimeoutErrorSource API slow or unresponsive Increase timeout or add retries -
Retry the run if the failure is transient:
- In Prefect UI, click Retry on the failed run
-
Or trigger a new run via CLI:
prefect deployment run <deployment-name>/production
Worker Not Picking Up Runs
- Check worker status in Prefect UI → Work Pools → select pool → Workers tab
-
If no workers are online:
- Prefect Cloud: Check ECS service or the machine running the worker
-
Self-hosted: Check the worker container:
docker compose logs prefect-worker
-
Common cause - the worker lost connection to the Prefect API. Restart it.
Deployment Not Found
Deployment not found
The deployment was deleted or never created. Redeploy:
cd ~/projects/data/data-pipelines
prefect deploy --all
5. dlt Issues
Schema Evolution Errors
Schema has changed
dlt detects schema changes automatically. If a source API returns new fields:
-
Check the schema file in
.dlt/schemas/:cat .dlt/schemas/<pipeline_name>/schema.json -
If the change is expected, let dlt evolve the schema automatically (default behaviour)
- If the change is unexpected, investigate the source API for breaking changes
Credential Resolution Failures
ConfigFieldMissingException: Missing config field
dlt resolves credentials in this order: environment variables → secrets.toml → config.toml → custom providers (e.g. AWS Secrets Manager).
- Check
.dlt/secrets.tomlfor local development - Check AWS Secrets Manager for production
- Verify the secret path matches what the pipeline expects (section name in
@dlt.source(section="..."))
6. Infrastructure Issues
Terraform State Lock
Error acquiring the state lock
Another Terraform operation is running, or a previous operation was interrupted.
-
Check who holds the lock:
aws dynamodb scan \ --table-name your-org-terraform-locks \ --profile infrastructure-admin -
If the lock is stale (the operation is no longer running):
terraform force-unlock <LOCK_ID>Force Unlock
Only force-unlock if you are absolutely certain no other Terraform operation is running. Forcing an unlock while another process is applying can corrupt state.
CI/CD Pipeline Failures
- Check GitHub Actions logs for the failing workflow
-
Common causes:
Error Fix OIDC authentication failed Check IAM role trust policy includes the repository Secret not found Verify the secret exists in AWS Secrets Manager Terraform plan shows drift Someone made manual changes - import or reconcile dbt build failed Check test failures, schema changes
ECS Service Unhealthy
For self-hosted services (Prefect, Airbyte, Lightdash):
- Check ECS service events in AWS Console → ECS → Clusters → Services
- Check CloudWatch Logs for the service's log group
- Common causes:
- Container failing health check → check application logs
- Out of memory → increase task definition memory
- Image pull failure → verify ECR repository or image tag
Verification
After resolving any issue:
- The immediate error is resolved
- Related pipelines and models run successfully
- No cascading failures in downstream systems
- Alerts have cleared (Prefect, PagerDuty, Slack)
- Root cause is understood and documented
Rollback
Most troubleshooting involves fixing forward rather than rolling back. If a fix introduces new issues:
- Revert the PR that introduced the fix
- Restore data using Snowflake Time Travel if needed
- Restart services that may be in a bad state
- Escalate if the issue is beyond your expertise
Escalation
- First contact: Data Engineering team in #data-eng Slack
- Snowflake platform issues: Snowflake Support (Enterprise accounts)
- AWS infrastructure: Infrastructure team lead
- Vendor-specific issues: Raise a support ticket with the vendor (Prefect, Airbyte, Lightdash)
See Also
- Logging and Debugging - Centralised logging setup
- Alerting and Incidents - Alert configuration and incident workflows
- Performance Optimisation - Query and pipeline performance
- Disaster Recovery - Data loss and infrastructure recovery