Writing Runbooks

On this page, you will:

Understand what a runbook is and why you need one
Learn the standard runbook structure
Write example runbooks for each repository
Establish a process for keeping runbooks current

Overview

A runbook is a step-by-step procedure for performing a specific operational task. Runbooks turn tribal knowledge into documented, repeatable processes. When an alert fires at 2am, the on-call engineer should not need to reverse-engineer the fix from first principles - they follow the runbook.

┌─────────────────────────────────────────────────────────────────────────┐
│                    RUNBOOK CATEGORIES                                    │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ROUTINE OPERATIONS         INCIDENT RESPONSE         MAINTENANCE       │
│  ──────────────────         ─────────────────         ───────────       │
│                                                                         │
│  Add a new user             Failed Terraform deploy   Rotate creds     │
│  Add a data source          Pipeline backfill needed   Upgrade dbt      │
│  Add a dbt model            Snowflake credit spike    Patch Prefect    │
│  Grant role access          Data quality alert         Review costs     │
│  Configure new pipeline     Stale data detected       Prune old data   │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

When to Write a Runbook

Write a runbook when:

A task involves more than three steps
Someone other than the original author may need to perform it
The task involves production systems where mistakes have consequences
You find yourself explaining the same process to multiple people
An incident occurs and you want to be prepared for next time

You do not need a runbook for tasks that are self-explanatory from the code (e.g. "run terraform plan") or tasks that have only one step.

Runbook Structure

Every runbook follows the same structure. Consistency makes runbooks faster to use under pressure - engineers know exactly where to find each piece of information.

Template

# Runbook: [Task Name]

## Summary

One or two sentences explaining what this runbook covers and when to use it.

## When to Use

- Trigger condition 1 (e.g. "A new team member needs Snowflake access")
- Trigger condition 2 (e.g. "PagerDuty alert: snowflake_user_access_denied")

## Prerequisites

- [ ] Access: [what access is needed, e.g. "Terraform repository write access"]
- [ ] Tools: [what tools are needed, e.g. "AWS CLI configured with infrastructure-admin profile"]
- [ ] Context: [any information needed, e.g. "New user's email and role requirements"]

## Steps

### 1. [First Step]

[Detailed instructions with commands and expected output]

### 2. [Second Step]

[Continue with each step]

### 3. [Third Step]

[And so on]

## Verification

How to confirm the task completed successfully:

- [ ] Check 1
- [ ] Check 2

## Rollback

If something goes wrong, how to undo the changes:

1. [Rollback step 1]
2. [Rollback step 2]

## Escalation

If the runbook does not resolve the issue:

- **First contact**: [team or person, e.g. "Data Engineering team in #data-eng Slack"]
- **Escalation**: [next level, e.g. "Infrastructure team lead"]

Example Runbooks

Terraform: Adding a New Snowflake User

This example demonstrates a complete runbook for one of the most common operational tasks.

# Runbook: Add a New Snowflake User

## Summary

Add a new team member to Snowflake via Terraform, granting them appropriate
roles and warehouse access.

## When to Use

- A new team member joins and needs Snowflake access
- An existing user needs a different access level

## Prerequisites

- [ ] Access: Write access to the terraform repository
- [ ] Tools: Terraform CLI, AWS CLI with `infrastructure-admin` profile
- [ ] Context: User's name, email, desired role (developer or admin)

## Steps

### 1. Determine the User Category

| Role | Default Role | Default Warehouse | Example |
|------|--------------|-------------------|---------|
| Admin | SYSADMIN | DEVELOPER | JBLOGGS_ADMIN |
| Developer | ANALYTICS_DEVELOPER | DEVELOPER | JBLOGGS |

### 2. Add to users.auto.tfvars

Edit `snowflake/config/users.auto.tfvars` and add the user to the
appropriate list:

` ``hcl
developer_user_list = {
  # ... existing users ...
  JBLOGGS = {
    email        = "jane.bloggs@company.com"
    first_name   = "Jane"
    last_name    = "Bloggs"
    display_name = "Jane Bloggs"
  }
}
` ``

### 3. Run Terraform Plan

` ``sh
cd snowflake/config
terraform plan
# Verify: should show new user resource being created
` ``

### 4. Create Pull Request

` ``sh
git checkout -b add-user-jbloggs
git add snowflake/config/users.auto.tfvars
git commit -m "Add Jane Bloggs as Snowflake developer"
git push -u origin add-user-jbloggs
# Create PR via GitHub
` ``

### 5. After Merge

CI/CD runs `terraform apply` automatically. The user is created with a
temporary password.

### 6. Provide Credentials

1. Set the user's initial password via Snowflake UI (as USERADMIN)
2. Send credentials via 1Password secure sharing
3. Ask user to change password and enable MFA on first login

## Verification

- [ ] User appears in Snowflake: `SHOW USERS LIKE 'JBLOGGS';`
- [ ] User has correct role: `SHOW GRANTS TO USER JBLOGGS;`
- [ ] User can log in and run a test query

## Rollback

1. Revert the PR (GitHub → PR → Revert)
2. Merge the revert PR — CI/CD runs `terraform apply` to remove the user
3. Or manually: `DROP USER JBLOGGS;` (as USERADMIN)

## Escalation

- **First contact**: Data Engineering team in #data-eng Slack
- **Escalation**: Infrastructure team lead

Claude Skills and Runbooks

If you have the add-snowflake-user Claude skill configured (covered in the Claude Code Setup page), Claude can perform steps 2-4 automatically. The runbook still documents the full process for cases where Claude is not available or the task needs manual review.

Data Pipelines: Responding to a Failed Pipeline

# Runbook: Respond to a Failed Pipeline

## Summary

Diagnose and resolve a dlt pipeline failure reported by Prefect alerting.

## When to Use

- Prefect sends a Slack/PagerDuty alert for a failed flow run
- Data freshness SLA is at risk

## Prerequisites

- [ ] Access: Prefect Cloud dashboard or self-hosted UI
- [ ] Access: AWS CloudWatch Logs (for ECS-hosted workers)
- [ ] Access: Snowflake (to verify data state)

## Steps

### 1. Check the Flow Run in Prefect

1. Open Prefect UI → Flow Runs
2. Find the failed run and click into it
3. Read the error message and traceback

### 2. Identify the Failure Category

| Error Pattern | Likely Cause | Action |
|---------------|-------------|--------|
| Connection refused / timeout | Source API down | Wait and retry |
| Authentication error | Credentials expired | Rotate credentials |
| Schema change | Source schema evolved | Update dlt schema |
| Rate limit exceeded | Too many API calls | Increase backoff |
| Snowflake error | Warehouse suspended / permissions | Check Snowflake |

### 3. Retry the Flow Run

If the failure is transient (network, rate limit):

1. In Prefect UI, click **Retry** on the failed run
2. Monitor the retry

### 4. Fix and Redeploy (if code change needed)

1. Fix the issue in the pipeline code
2. Test locally: `python -m pipelines.<pipeline_name>`
3. Create PR, merge, deploy via CI/CD
4. Trigger a manual run in Prefect

## Verification

- [ ] Flow run completes successfully in Prefect
- [ ] Data appears in Snowflake with expected row counts
- [ ] No downstream dbt test failures

## Rollback

If the fix introduces new issues:

1. Revert the PR
2. Trigger the previous pipeline version via Prefect

## Escalation

- **First contact**: Data Engineering team in #data-eng Slack
- **Escalation**: On-call data engineer

dbt: Running a Full Refresh

# Runbook: Run a Full Refresh for a dbt Model

## Summary

Rebuild a dbt incremental model from scratch, replacing all data.

## When to Use

- Incremental logic has changed and historical data needs reprocessing
- Source data was corrected retroactively
- Model has accumulated errors from incremental loads

## Prerequisites

- [ ] Access: dbt CLI configured with Snowflake credentials
- [ ] Context: Model name and reason for full refresh
- [ ] Timing: Schedule during low-usage window (full refresh uses more compute)

## Steps

### 1. Verify the Model is Incremental

` ``sh
dbt ls --select model_name --output json | grep materialized
# Should show: "materialized": "incremental"
` ``

### 2. Run the Full Refresh

` ``sh
dbt run --select model_name --full-refresh
` ``

### 3. Run Tests

` ``sh
dbt test --select model_name
` ``

### 4. Verify Row Counts

` ``sql
SELECT COUNT(*) FROM ANALYTICS.MARTS.MODEL_NAME;
-- Compare with expected count
` ``

## Verification

- [ ] Model rebuilt successfully (no dbt errors)
- [ ] All tests pass
- [ ] Row count is within expected range
- [ ] Downstream models still function correctly

## Rollback

Incremental models retain historical data via Snowflake Time Travel:

` ``sql
-- Restore to state before full refresh (within retention period)
CREATE OR REPLACE TABLE ANALYTICS.MARTS.MODEL_NAME
  CLONE ANALYTICS.MARTS.MODEL_NAME AT (OFFSET => -3600);
` ``

## Escalation

- **First contact**: Analytics Engineering team in #analytics Slack
- **Escalation**: dbt project owner

Including Runbooks in the Docs Site

Runbooks live in each repository's docs/runbooks/ directory. The index.md page in that directory serves as a table of contents:

# Runbooks

Operational procedures for the Terraform infrastructure repository.

| Runbook | When to Use |
|---------|------------|
| [Add a Snowflake User](add-snowflake-user.md) | New team member needs access |
| [Rotate Credentials](rotate-credentials.md) | Scheduled rotation or compromise |
| [Failed Deployment](failed-deployment.md) | CI/CD terraform apply fails |
| [Add a Data Source](add-data-source.md) | New ingestion source needed |

Update each repository's docs/mkdocs.yml nav to include runbook pages as they are written.

Keeping Runbooks Current

Runbooks go stale quickly if not maintained. Several practices help:

Test Runbooks Periodically

Run through each runbook at least once per quarter, even if there is no active need. This catches:

Commands that no longer work due to tool updates
Steps that assume configuration that has changed
Links to dashboards or tools that have moved

Link Runbooks to Alerts

When creating Prefect automations, PagerDuty alerts, or Slack notifications, include a link to the relevant runbook:

# In Prefect automation
send_slack_notification(
    message=f"Pipeline {flow_name} failed. "
    f"Runbook: https://<your-org>.github.io/technical-documentation/"
    f"data-pipelines/runbooks/failed-pipeline/"
)

Update After Incidents

After every incident, review the runbook used:

Did the steps work as documented?
Were any steps missing?
Did the verification steps catch the resolution?
Does the escalation path need updating?

Update the runbook in the same PR as any code fixes from the incident.

Summary

What You've Accomplished

Understand when and why to write runbooks
Know the standard runbook structure (summary, when to use, prerequisites, steps, verification, rollback, escalation)
Have example runbooks for Terraform, data pipelines, and dbt
Know how to organise and maintain runbooks

What's Next

With documentation content in place, the next step is to set up CI/CD - validating documentation builds on pull requests and deploying the unified site to GitHub Pages automatically.

Continue to CI/CD Pipeline →