Runbook: Troubleshooting

Summary

Diagnose and resolve common issues across the data stack. This runbook provides systematic debugging approaches for each layer - Snowflake, dbt, Prefect, dlt, Airbyte, and infrastructure.

When to Use

A pipeline, model, or query is failing and the cause is unclear
An alert fires and the standard runbook doesn't resolve the issue
Something worked before and now doesn't

Prerequisites

Access: Relevant tool dashboards (Prefect, Snowflake, dbt Cloud)
Access: AWS CloudWatch Logs (for ECS-hosted services)
Context: Error message, affected component, when it last worked

Steps

1. General Debugging Approach

For any issue, follow this sequence:

Read the error message - most errors are descriptive
Check when it last worked - what changed since then?
Check the logs - Prefect UI, CloudWatch, Snowflake query history
Reproduce locally (if possible) - run the failing pipeline or model in dev
Fix, test, deploy - fix in a branch, test locally, create a PR

2. Snowflake Issues

Authentication Errors

Error	Cause	Fix
`Incorrect username or password`	Wrong credentials	Check `.dlt/secrets.toml` or Secrets Manager
`JWT token is invalid`	Expired or wrong key pair	Rotate the key (see Security Hardening)
`User is disabled`	User account disabled	Re-enable: `ALTER USER <name> SET DISABLED = FALSE;`
`IP not allowed`	Network policy blocking	Check network policy assignments and allowed IPs

Permission Errors

Insufficient privileges to operate on <object>

Check the user's current role:
```
SELECT CURRENT_USER(), CURRENT_ROLE();
```
Check what roles are granted:
```
SHOW GRANTS TO USER <username>;
```
Check what the role can access:
```
SHOW GRANTS TO ROLE <role_name>;
```

Common fix - the user needs a different role. Either switch roles or update Terraform:

-- Temporary fix
USE ROLE <correct_role>;

-- Permanent fix via Terraform
-- Add the role to user_additional_roles in users.tf or users.auto.tfvars

Query Timeouts

Statement reached its maximum execution time of 300 seconds

Check the query profile in Snowsight → Activity → Query History
Identify the bottleneck (spilling, full scan, exploding JOIN)

Temporary fix - increase timeout:

ALTER SESSION SET STATEMENT_TIMEOUT_IN_SECONDS = 600;

Permanent fix - optimise the query (see Performance Optimisation)

Warehouse Suspended

Warehouse is suspended. Resume warehouse to execute queries.

If AUTO_RESUME = TRUE (which it should be), this error usually indicates a resource monitor has suspended the warehouse:

SHOW RESOURCE MONITORS;
-- Check if any monitor has hit its credit limit

To resume temporarily:

USE ROLE SYSADMIN;
ALTER WAREHOUSE <warehouse> RESUME;

To increase the credit limit:

USE ROLE ACCOUNTADMIN;
ALTER RESOURCE MONITOR <monitor> SET CREDIT_QUOTA = <new_limit>;

3. dbt Issues

Model Compilation Errors

Compilation Error in model <model_name>

Check the error details - usually a missing ref, source, or macro
Common causes:
- Missing source definition in _sources.yml
- Typo in {{ ref('model_name') }} or {{ source('name', 'table') }}
- Package not installed: run dbt deps

Test Failures

Failure in test <test_name>

Check what data failed:
```
dbt test --select <test_name> --store-failures
```
Failed rows are stored in the dbt_test__audit schema.

Query the failures:

SELECT * FROM ANALYTICS_DEV.DBT_TEST__AUDIT.<test_name>;

Common causes:
- Source data changed (new NULL values, duplicates)
- Upstream model bug introduced in a recent PR
- Test threshold too strict (e.g. freshness, accepted values)

Incremental Model Drift

If an incremental model is producing wrong results:

Run a full refresh to rebuild from scratch:

dbt run --select <model> --full-refresh

Compare row counts before and after
If the issue recurs, check the is_incremental() filter logic

Schema Changes

Compilation Error: column "new_column" does not exist

Source schemas evolve. When upstream tools add or rename columns:

Check what changed in the source table:

DESCRIBE TABLE <DATABASE>.<SCHEMA>.<TABLE>;

Update the staging model to handle the new schema
Update _sources.yml if table structure changed significantly
Consider on_schema_change='append_new_columns' for incremental models that should absorb new columns automatically

4. Prefect Issues

Flow Run Failed

Check the flow run logs in Prefect UI → Flow Runs → click the failed run

Common causes:

Error Pattern	Cause	Fix
`ModuleNotFoundError`	Missing dependency	Add to `pyproject.toml` with `uv add`
`ConnectionRefusedError`	Source API or database down	Wait and retry
`PermissionError`	Missing credentials or expired token	Check Secrets Manager
`TimeoutError`	Source API slow or unresponsive	Increase timeout or add retries

Retry the run if the failure is transient:
- In Prefect UI, click Retry on the failed run
- Or trigger a new run via CLI:
```
prefect deployment run <deployment-name>/production
```

Worker Not Picking Up Runs

Check worker status in Prefect UI → Work Pools → select pool → Workers tab
If no workers are online:
- Prefect Cloud: Check ECS service or the machine running the worker
- Self-hosted: Check the worker container:
```
docker compose logs prefect-worker
```
Common cause - the worker lost connection to the Prefect API. Restart it.

Deployment Not Found

Deployment not found

The deployment was deleted or never created. Redeploy:

cd ~/projects/data/data-pipelines
prefect deploy --all

5. dlt Issues

Schema Evolution Errors

Schema has changed

dlt detects schema changes automatically. If a source API returns new fields:

Check the schema file in .dlt/schemas/:

cat .dlt/schemas/<pipeline_name>/schema.json

If the change is expected, let dlt evolve the schema automatically (default behaviour)
If the change is unexpected, investigate the source API for breaking changes

Credential Resolution Failures

ConfigFieldMissingException: Missing config field

dlt resolves credentials in this order: environment variables → secrets.toml → config.toml → custom providers (e.g. AWS Secrets Manager).

Check .dlt/secrets.toml for local development
Check AWS Secrets Manager for production
Verify the secret path matches what the pipeline expects (section name in @dlt.source(section="..."))

6. Infrastructure Issues

Terraform State Lock

Error acquiring the state lock

Another Terraform operation is running, or a previous operation was interrupted.

Check who holds the lock:

aws dynamodb scan \
    --table-name your-org-terraform-locks \
    --profile infrastructure-admin

If the lock is stale (the operation is no longer running):
```
terraform force-unlock <LOCK_ID>
```
Force Unlock

Only force-unlock if you are absolutely certain no other Terraform operation is running. Forcing an unlock while another process is applying can corrupt state.

CI/CD Pipeline Failures

Check GitHub Actions logs for the failing workflow