Observability Fundamentals

On this page, you will:

Understand the five dimensions of data quality
Learn SLIs, SLOs, and SLAs for data pipelines
Distinguish between observability, monitoring, and testing
Understand incident severity levels
Know when to alert, log, or ignore issues

Overview

Before building observability infrastructure, you need to understand the fundamentals: what makes data "good quality", how to measure pipeline reliability, and when to take action on issues.

This page covers the core concepts that underpin all observability practices in data engineering.

┌─────────────────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY FUNDAMENTALS                           │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  Data Quality Dimensions          Reliability Metrics                  │
│  ───────────────────────          ───────────────────                  │
│                                                                         │
│  • Accuracy                       • SLI (Indicator)                    │
│  • Completeness                     "dbt run success rate"             │
│  • Consistency                    • SLO (Objective)                    │
│  • Timeliness                       "99% of runs succeed"              │
│  • Validity                       • SLA (Agreement)                    │
│                                     "We guarantee 99% uptime"          │
│                                                                         │
│  Incident Severity                Action Thresholds                    │
│  ─────────────────                ──────────────────                   │
│                                                                         │
│  • SEV1: Data outage              Alert → Immediate page-out           │
│  • SEV2: Quality degraded         Alert → Next business day           │
│  • SEV3: Minor issue              Log → Review weekly                 │
│  • SEV4: Cosmetic                 Log → Review monthly                │
│                                                                         │
└─────────────────────────────────────────────────────────────────────────┘

The Five Dimensions of Data Quality

Data quality is multi-dimensional. A dataset can be accurate but late, or complete but inconsistent. Understanding these dimensions helps you design appropriate tests and monitors.

1. Accuracy

Definition: Data correctly represents the real-world entity or event it describes.

Examples: - ✅ Good: Customer email is alice@example.com (matches reality) - ❌ Bad: Customer email is bob@example.com (wrong person)

How to measure: - Compare data to authoritative source (e.g., manual audit sample) - Check calculations match expected results (e.g., revenue = quantity * price) - Validate against business rules (e.g., discounts never exceed 100%)

dbt test example:

# Ensure calculated revenue matches sum of line items
- name: order_total_accuracy
  test: dbt_utils.expression_is_true
  config:
    expression: "abs(order_total - (quantity * unit_price)) < 0.01"

2. Completeness

Definition: All expected data is present; no missing values or records.

Examples: - ✅ Good: All 1000 orders from yesterday loaded into warehouse - ❌ Bad: Only 950 orders loaded (50 missing)

How to measure: - Row count matches source system - No unexpected NULL values in required columns - All time periods have data (no gaps in daily data)

dbt test example:

# Ensure critical columns have no nulls
- name: customer_id
  tests:
    - not_null
    - dbt_expectations.expect_column_values_to_not_be_null:
        row_condition: "order_status != 'cancelled'"

3. Consistency

Definition: Data is uniform across systems and over time; no contradictions.

Examples: - ✅ Good: Customer name is "Alice Smith" in both CRM and orders table - ❌ Bad: Customer name is "Alice Smith" in CRM, "A. Smith" in orders

How to measure: - Cross-system comparisons (CRM vs warehouse) - Referential integrity (foreign keys exist) - Temporal consistency (values don't change illogically)

dbt test example:

# Ensure foreign keys are valid
- name: customer_id
  tests:
    - relationships:
        to: ref('dim_customers')
        field: customer_id

4. Timeliness

Definition: Data is available when needed; minimal lag between event and availability.

Examples: - ✅ Good: Yesterday's orders available by 6am today - ❌ Bad: Yesterday's orders available at 2pm today (8 hour delay)

How to measure: - Data freshness (max age of data) - Pipeline SLAs (ingestion completed within X hours) - Time-to-availability metrics

dbt test example:

# dbt source freshness check
sources:
  - name: raw_orders
    freshness:
      warn_after: {count: 6, period: hour}
      error_after: {count: 12, period: hour}

5. Validity

Definition: Data conforms to defined formats, types, and rules.

Examples: - ✅ Good: Email is user@domain.com (valid format) - ❌ Bad: Email is not_an_email (invalid format)

How to measure: - Type checking (dates are dates, integers are integers) - Format validation (emails, phone numbers, postcodes) - Range validation (age between 0 and 120)

dbt test example:

# Ensure values are within expected ranges
- name: order_total
  tests:
    - dbt_expectations.expect_column_values_to_be_between:
        min_value: 0
        max_value: 1000000

SLIs, SLOs, and SLAs

These metrics help you measure and communicate pipeline reliability.

SLI (Service Level Indicator)

Definition: A quantifiable measure of service quality.

Data pipeline examples: - dbt run success rate: successful_runs / total_runs - Data freshness: max(loaded_at) - current_time - Pipeline duration: time(end) - time(start) - Data quality: passed_tests / total_tests

How to calculate:

-- dbt run success rate over last 30 days
SELECT
    COUNT(CASE WHEN status = 'success' THEN 1 END) * 100.0 / COUNT(*) AS success_rate_pct
FROM dbt_run_history
WHERE run_date >= CURRENT_DATE() - INTERVAL '30 days';

SLO (Service Level Objective)

Definition: Target value or range for an SLI. Your internal goal.

Data pipeline examples: - dbt run success rate ≥ 99% (99% of runs succeed) - Data freshness ≤ 4 hours (data available within 4 hours of event) - Pipeline duration ≤ 30 minutes (95th percentile) - Data quality ≥ 95% (95% of tests pass)

SLOs are: - Internal targets (not customer-facing promises) - Realistic (based on historical performance) - Measurable (can be calculated from SLIs) - Time-bound (measured over a period, e.g., 30 days)

Example SLO definition:

# SLOs for the dbt daily run
slos:
  - name: dbt_daily_run_success_rate
    sli: successful_dbt_runs / total_dbt_runs
    target: 0.99  # 99%
    period: 30_days

  - name: dbt_daily_run_duration
    sli: p95(dbt_run_duration_minutes)
    target: 30  # 30 minutes
    period: 30_days

SLA (Service Level Agreement)

Definition: A contractual commitment to customers. What you promise.

Data pipeline examples: - "Marketing dashboard updated by 8am daily" (customer: marketing team) - "Financial reports available within 24 hours of month-end" (customer: finance team) - "Customer data deleted within 30 days of request" (customer: GDPR compliance)

SLAs are: - External commitments (to stakeholders, customers, regulators) - Contractual (may have penalties if violated) - Stricter than SLOs (SLO = 99%, SLA = 95% to leave error budget)

Example SLA:

## Marketing Dashboard SLA

**Commitment:** Marketing KPI dashboard refreshed by 08:00 UTC daily.

**Scope:** Includes metrics from Airbyte (HubSpot), dbt (fct_contacts), Lightdash (dashboard)

**Target:** 95% of days (28.5 days per month)

**Penalties:** None (internal SLA)

**Escalation:** If not met, page on-call data engineer.

Relationship Between SLI, SLO, SLA

SLI (Measurement)
    ↓
SLO (Internal Target)
    ↓
SLA (External Promise)

Example:
SLI: dbt run success rate = 99.2% (last 30 days)
SLO: Target ≥ 99%
SLA: Promise to marketing team: "Dashboard updated 95% of days"

Error Budget: If your SLO is 99%, you have a 1% error budget (3.6 hours/month of downtime allowable).

Observability vs Monitoring vs Testing

These concepts overlap but have distinct purposes.

Testing

Definition: Validate data against known rules at a specific point in time.

Characteristics: - Proactive — define rules upfront - Binary — pass or fail - Known rules — test what you know to check

Examples: - dbt test: customer_id must be unique - dbt test: order_total must be ≥ 0 - Python test: assert len(df) > 0

When to use: You know what "good" looks like and can define explicit rules.

Monitoring

Definition: Track metrics over time and alert when thresholds are crossed.

Characteristics: - Reactive — respond to threshold violations - Time-series — track trends, not single points - Known thresholds — alert when metric exceeds/drops below limit

Examples: - Alert if dbt run duration > 60 minutes - Alert if row count drops 50% from 7-day average - Alert if Snowflake credit usage > $100/day

When to use: You know what metrics matter and can define "normal" ranges.

Observability

Definition: Understand system state from outputs; explore unknowns with full context.

Characteristics: - Exploratory — ask new questions after incidents - Context-rich — combines logs, metrics, lineage, metadata - Unknown unknowns — investigate issues you didn't anticipate

Examples: - "Why did revenue drop 20% last Tuesday?" → Use lineage to trace upstream - "Which dashboard queries are slowest?" → Analyse query logs + BI tool metadata - "What broke when we changed this dbt model?" → Impact analysis via lineage

When to use: Debugging complex issues, root cause analysis, understanding cascading failures.

Comparison Table

	Testing	Monitoring	Observability
Purpose	Validate known rules	Track known metrics	Explore unknowns
Timing	Point-in-time	Continuous	Post-incident
Output	Pass/Fail	Metrics, alerts	Insights, root causes
Tools	dbt tests, Great Expectations	Prefect, Elementary, Datadog	OpenMetadata, logs, lineage
Example	"Is customer_id unique?"	"Is row count stable?"	"Why did the pipeline fail?"

You need all three: - Tests catch known bad data - Monitoring alerts you to anomalies - Observability helps you debug and learn

Incident Severity Levels

Not all issues are equal. Classify incidents by severity to prioritize response.

SEV1 (Critical) - Data Outage

Definition: Core data pipelines broken; critical dashboards unavailable.

Examples: - dbt daily run failed; all dashboards stale - Snowflake warehouse suspended due to credit limit - Airbyte sync broken for primary revenue data source

Response: - Immediate page-out (call on-call engineer) - Drop everything to resolve - Target resolution: < 2 hours - Post-incident review: Always

Alert channel: PagerDuty (page), Slack (urgent), Email

SEV2 (High) - Quality Degraded

Definition: Data available but quality issues; non-critical dashboards affected.

Examples: - dbt test failures (data loaded but may be incorrect) - Anomaly detected: row count 30% lower than expected - Secondary data source (e.g., product catalogue) stale

Response: - Alert on-call during business hours - Investigate within 4 hours - Target resolution: < 1 business day - Post-incident review: If recurrent

Alert channel: Slack (non-urgent), Email

SEV3 (Medium) - Minor Issue

Definition: Edge case failure; minimal user impact.

Examples: - Single dbt test failed on a rarely-used model - Log warnings (not errors) - Performance degradation (queries 2x slower than normal, but still < 10 seconds)

Response: - Log for review - Investigate weekly - Target resolution: < 1 week - Post-incident review: No

Alert channel: Log only (no Slack/email)

SEV4 (Low) - Cosmetic

Definition: Documentation, cosmetic issues; no functional impact.

Examples: - dbt model missing description - Dashboard title typo - Log noise (non-actionable warnings)

Response: - Backlog ticket - Fix opportunistically - Target resolution: No deadline - Post-incident review: No

Alert channel: None (create Jira/Linear ticket)

When to Alert, Log, or Ignore

Alert fatigue is real. Too many alerts → ignored alerts → missed critical issues.

Alert (Send to Slack/PagerDuty/Email)

When: - SEV1 or SEV2 incidents - Immediate action required - User-facing impact (dashboards broken, data stale) - SLA at risk

Examples: - dbt run failed - Data freshness > 12 hours (SLA is 6 hours) - Snowflake Resource Monitor hit 90% of budget - More than 3 dbt tests failed

Alert criteria: - Actionable — someone can fix it now - Urgent — needs attention within hours - User-facing — affects stakeholders

Log (Write to CloudWatch/File; Review Periodically)

When: - SEV3 or SEV4 incidents - Informational (not actionable) - Debugging context (for future incidents) - Performance metrics (for trend analysis)

Examples: - dbt test passed (log for audit trail) - Query took 5 seconds (within normal range) - Prefect task retried once and succeeded - Row count within expected range but at the low end

Log criteria: - Not urgent — can review later - Informational — good to know, not actionable - Trending — useful for pattern detection

Ignore (Don't Alert or Log)

When: - Expected behavior - Transient issues that auto-resolve - Noise (e.g., connection retries that succeed)

Examples: - Snowflake warehouse auto-suspended after idle time (expected) - Prefect task retried once and succeeded (handled automatically) - Log message "Query compiled successfully" (too noisy, not useful)

Ignore criteria: - Expected — normal system behavior - Auto-resolved — transient, self-healing - Too frequent — would create noise

Decision Tree

Issue detected
    │
    ├─ User-facing impact? ─── YES ──▶ Alert (SEV1/SEV2)
    │                           NO
    │                            ↓
    ├─ Actionable now? ──────── YES ──▶ Alert (SEV2/SEV3)
    │                           NO
    │                            ↓
    ├─ Useful for debugging? ── YES ──▶ Log (SEV3/SEV4)
    │                           NO
    │                            ↓
    └─ Ignore (noise, expected behavior)

Summary

You've learned observability fundamentals:

Five data quality dimensions — Accuracy, Completeness, Consistency, Timeliness, Validity
SLIs, SLOs, SLAs — Measure reliability (SLI), set targets (SLO), promise uptime (SLA)
Observability vs Monitoring vs Testing — Explore (observability), track (monitoring), validate (testing)
Incident severity — SEV1 (outage), SEV2 (degraded), SEV3 (minor), SEV4 (cosmetic)
Alert thresholds — Alert (urgent), Log (review later), Ignore (noise)

These concepts underpin all observability tools and practices. You'll apply them throughout this section when configuring dbt tests, Elementary anomaly detection, and Prefect monitoring.

What's Next

Apply these fundamentals to data quality testing with dbt.

Continue to Data Quality with dbt →