Finishing Up

On this page, you will:

Review what you've built in the Observability section
Understand the observability maturity model
Learn when to upgrade from budget to premium tools
Explore next steps beyond observability
Access additional resources and community support

Summary

Congratulations! You've implemented a comprehensive observability stack for your modern data platform.

Observability Components

Observability Stack ✅
├── Data Quality
│   ├── dbt tests (schema.yml files with generic + custom tests) ✅
│   ├── dbt_expectations package (statistical validation) ✅
│   └── Elementary
│       ├── Elementary dbt package (test results tracking) ✅
│       ├── Elementary CLI (anomaly detection) ✅
│       └── Elementary UI (self-hosted or Elementary Cloud) ✅
│
├── Data Catalog
│   └── OpenMetadata (self-hosted on ECS)
│       ├── Snowflake connector (metadata extraction) ✅
│       ├── dbt connector (lineage from manifest.json) ✅
│       ├── Prefect connector (pipeline metadata) ✅
│       └── UI (search, lineage, data quality dashboard) ✅
│
├── Pipeline Monitoring
│   ├── Prefect Cloud UI (flow runs, states, logs) ✅
│   ├── Prefect automations (retry on failure, Slack alerts) ✅
│   └── Snowflake query history (performance profiling) ✅
│
├── Cost Monitoring
│   ├── Snowflake Resource Monitors (budget alerts) ✅
│   ├── AWS Cost Explorer (infrastructure spend) ✅
│   └── Cost allocation tags (Terraform-managed resources) ✅
│
└── Logging & Debugging
    ├── CloudWatch Logs (ECS services: Prefect, Airbyte, Lightdash) ✅
    ├── Snowflake query logs (QUERY_HISTORY view) ✅
    └── dbt logs (run_results.json, compiled SQL) ✅

Key Capabilities

You can now:

Detect issues proactively: - Elementary catches volume, freshness, and schema anomalies - dbt tests validate data quality on every run - Prefect monitors pipeline health in real-time

Debug issues quickly: - OpenMetadata lineage traces data from source to dashboard - Centralised logs in CloudWatch show exactly what failed - Snowflake Query Profile identifies slow queries

Control costs: - Snowflake Resource Monitors prevent budget overruns - AWS Cost Explorer tracks infrastructure spending - Cost allocation tags enable team-level chargeback

Respond to incidents: - Runbooks provide step-by-step resolution guides - On-call rotations ensure 24/7 coverage for critical alerts - Post-incident reviews prevent recurrence

Observability Maturity Model

Your observability practice will evolve over time. Here's a maturity model to guide your growth.

Level 1: Reactive (Day 1-30)

Characteristics: - Manual monitoring — checking dashboards daily - Reacting to user reports of issues - Basic dbt tests (not_null, unique) - No automated alerting

What you've outgrown: - You've moved past this stage by implementing automated monitoring and alerting

Level 2: Proactive (Day 30-90) — You are here

Characteristics: - Automated alerting for pipeline failures - dbt tests on all critical tables - Elementary anomaly detection running - Slack/email notifications configured - Basic runbooks for common incidents

Next steps: - Expand test coverage to 80%+ of models - Add SLOs for critical pipelines - Refine anomaly detection to reduce false positives

Level 3: Predictive (Day 90-180)

Characteristics: - Machine learning-based anomaly detection - Forecasting data quality issues before they occur - Comprehensive SLI/SLO tracking - Automated remediation (auto-retry, auto-rollback) - Detailed post-incident reviews for all SEV1/SEV2 incidents

Tools to add: - Monte Carlo (ML-based data observability) - Datadog (unified monitoring across infrastructure and applications) - Custom ML models for anomaly prediction

Level 4: Optimised (Day 180+)

Characteristics: - Data quality SLAs with stakeholders - Continuous optimisation based on metrics - Cross-team collaboration on data quality - Automated data quality scoring - Self-healing pipelines

Advanced practices: - Data contracts between teams - Data product thinking (treat datasets as products with SLAs) - Automated incident classification and routing - Predictive capacity planning

How to Progress

From Level 2 to Level 3: 1. Track SLIs (success rate, latency, data freshness) for 3 months 2. Set SLOs based on historical performance (e.g., 99% uptime) 3. Implement automated remediation (Prefect automations for retries) 4. Expand anomaly detection to all fact/dimension tables 5. Conduct post-incident reviews for every SEV1/SEV2 incident

From Level 3 to Level 4: 1. Define data SLAs with business stakeholders 2. Implement data contracts (schemas, quality guarantees) 3. Build custom ML models for anomaly prediction 4. Create data product teams responsible for specific datasets 5. Measure and report on data quality KPIs monthly

When to Upgrade Tools

You've built the budget approach (~$80-100/month for observability). Here's when to upgrade to premium tools.

Elementary Cloud ($50+/month)

Upgrade when: - Self-hosted Elementary UI requires too much maintenance - Team grows to 10+ people (managed service worth the cost) - Need longer than 7-day log retention

Benefits: - Zero infrastructure management - Automatic updates - Hosted at https://your-org.elementary-data.com

Budget impact: ~+$50/month

Monte Carlo ($$$, quote-based)

Upgrade when: - False positives from Elementary too high (Monte Carlo uses ML to reduce noise) - Need lineage across 10+ data sources - Require automated incident classification - Enterprise compliance needs (SOC 2, GDPR auditing)

Benefits: - ML-based anomaly detection (fewer false positives) - Multi-cloud support (Snowflake, BigQuery, Redshift, Databricks) - Automated root cause analysis - Field-level lineage

Budget impact: ~$1,000-5,000/month depending on data volume

Datadog ($15+/host/month)

Upgrade when: - Need unified monitoring for data platform + application infrastructure - Want APM (Application Performance Monitoring) for ECS services - Require advanced alerting with ML-based thresholds - Large engineering team (50+ people) already using Datadog

Benefits: - Unified dashboards for infrastructure, applications, and logs - APM for deep performance profiling - Advanced anomaly detection - Integrations with 500+ tools

Budget impact: ~$200-500/month for data platform infrastructure

Great Expectations Cloud ($$$, quote-based)

Upgrade when: - Data science team wants to define expectations in Python - Need collaborative expectation editing (UI for non-engineers) - Require centralised expectation management across teams - Want integration with dbt Cloud

Benefits: - Managed Great Expectations platform - UI for creating and editing expectations - Integration with data catalogs - Team collaboration features

Budget impact: ~$500-2,000/month

Stay on Budget Build If:

Total team size <10 people
Comfortable managing self-hosted infrastructure
Cost-conscious (startup, early-stage)
Elementary + dbt tests meet 90% of needs

The budget build provides 80% of the value at 10% of the cost. Upgrade only when specific pain points justify the expense.

Observability Metrics to Track

Measure observability effectiveness with these KPIs:

1. Pipeline Uptime

Definition: Percentage of pipeline runs that succeed

Target: ≥ 99% (SLO)

Calculation:

SELECT
    COUNT(CASE WHEN status = 'success' THEN 1 END) * 100.0 / COUNT(*) AS uptime_pct
FROM dbt_run_history
WHERE run_date >= CURRENT_DATE() - INTERVAL '30 days';

Improvement actions: - If <95%: Focus on fixing flaky tests and unreliable API connections - If 95-99%: Optimise retry logic and add circuit breakers - If >99%: Maintain current practices

2. Mean Time to Detection (MTTD)

Definition: Average time from issue occurring to being detected

Target: < 1 hour for SEV1, < 4 hours for SEV2

Measurement: - Track time from pipeline failure to alert sent - Review incident logs monthly

Improvement actions: - Add Elementary freshness tests to detect stale data faster - Increase anomaly detection frequency (hourly instead of daily) - Add real-time monitors for critical tables

3. Mean Time to Resolution (MTTR)

Definition: Average time from detection to resolution

Target: < 2 hours for SEV1, < 1 day for SEV2

Measurement: - PagerDuty tracks this automatically - Manually log for non-PagerDuty incidents

Improvement actions: - Create runbooks for top 10 incident types - Implement automated remediation where possible - Improve logging for faster debugging

4. Data Quality Score

Definition: Percentage of dbt tests that pass

Target: ≥ 95%

Calculation:

SELECT
    SUM(CASE WHEN status = 'pass' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS quality_score_pct
FROM elementary.dbt_tests
WHERE test_execution_time >= CURRENT_DATE() - INTERVAL '7 days';

Improvement actions: - If <90%: Fix failing tests or adjust thresholds - If 90-95%: Expand test coverage to untested models - If >95%: Add anomaly detection for proactive monitoring

5. Cost Efficiency

Definition: Snowflake credit cost per row processed

Target: Stable or decreasing over time

Calculation:

WITH monthly_stats AS (
    SELECT
        DATE_TRUNC('month', run_date) AS month,
        SUM(credits_used) AS total_credits,
        SUM(rows_processed) AS total_rows
    FROM dbt_run_history
    GROUP BY month
)

SELECT
    month,
    total_credits,
    total_rows,
    total_credits / total_rows AS credits_per_row
FROM monthly_stats
ORDER BY month DESC;

Improvement actions: - If increasing: Optimise slow queries, right-size warehouses - If stable: Monitor for unexpected spikes - If decreasing: Continue current optimisation efforts

What's Next

You've completed the Observability section. Here are logical next steps:

Option 1: Build Reverse ETL

What: Send data from Snowflake back to operational systems (CRM, marketing tools)

Why: Close the loop — insights from analytics feed back into operations

Tools: Census, Hightouch, or Airbyte Reverse ETL

Documentation: Build: Reverse ETL (future section)

Option 2: Implement Data Governance

What: Define ownership, access policies, and data classification

Why: Ensure compliance (GDPR, CCPA) and prevent unauthorised data access

Tools: OpenMetadata (governance features), Snowflake tags, AWS IAM policies

Documentation: Build: Data Governance (future section)

Option 3: Add Machine Learning

What: Train ML models on your data warehouse

Why: Predictive analytics, churn prediction, recommendation engines

Tools: Snowflake Snowpark (Python in Snowflake), AWS SageMaker, dbt Python models

Documentation: Build: Machine Learning (future section)

Option 4: Scale Infrastructure

What: Multi-region deployment, disaster recovery, high availability

Why: Production-grade reliability for mission-critical data

Documentation: Maintain: Disaster Recovery (future section)

Option 5: Optimise for Performance

What: Query optimisation, clustering keys, materialised views

Why: Faster dashboards, lower costs, better user experience

Estimated effort: Ongoing

Documentation: Maintain: Performance Optimisation (future section)

Common Challenges and Solutions

Challenge: Alert Fatigue

Symptom: Too many alerts, team ignores them

Solution: 1. Review alerts weekly — disable noisy, low-value alerts 2. Increase anomaly detection thresholds (sensitivity: 3 → 4) 3. Aggregate alerts — send daily summary instead of real-time alerts for SEV3

Challenge: Runbook Drift

Symptom: Runbooks become outdated, don't match current system

Solution: 1. Review runbooks quarterly 2. Update runbooks after every incident (part of post-incident review) 3. Version control runbooks in Git

Challenge: Cost Creep

Symptom: Costs increasing 10-20% month-over-month

Solution: 1. Set up AWS Budgets and Snowflake Resource Monitors (you've done this!) 2. Review top 10 most expensive queries monthly 3. Archive cold data to S3 after 2 years

Challenge: Siloed Observability

Symptom: Each team has their own monitoring tools, no unified view

Solution: 1. Centralise logs in CloudWatch 2. Use OpenMetadata as single source of truth for metadata 3. Create shared Slack channels for alerts (#data-alerts, #data-critical)

Challenge: No Time for Observability

Symptom: "We're too busy building features to invest in observability"

Solution: 1. Start small — add Elementary to one critical table 2. Demonstrate value — show how one anomaly detection prevented an incident 3. Make observability part of definition of done (every new model must have tests)

Resources

Documentation

Community

dbt Slack Community — #advice-dbt-tests channel
Locally Optimistic Slack — Data professionals community
Data Engineering Subreddit
Prefect Slack Community

Courses

DataCamp: Data Quality in Python
Udemy: dbt Fundamentals
Elementary Academy — Free courses on data observability

Blogs

Summary

You've completed the Observability section:

Data quality testing — dbt tests, dbt_expectations, Elementary
Data cataloging — OpenMetadata with lineage, search, and governance
Pipeline monitoring — Prefect automations, Slack alerts, custom metrics
Cost monitoring — Snowflake Resource Monitors, AWS Budgets, cost attribution
Alerting and incidents — runbooks, on-call rotations, post-incident reviews
Logging and debugging — CloudWatch Logs, query profiling, systematic troubleshooting
Anomaly detection — Elementary automated detection, custom SQL checks, Great Expectations

Total monthly cost: ~$80-100 (Elementary + OpenMetadata self-hosted + CloudWatch Logs)

You've built a production-grade observability stack that provides: - Proactive issue detection — catch problems before users notice - Fast debugging — resolve incidents in minutes, not hours - Cost control — prevent surprise bills - Continuous improvement — learn from incidents and prevent recurrence

What's Next

Your modern data stack is now observable, reliable, and cost-efficient.

Next recommended section: Build: Data Transformation (Advanced dbt patterns, materialisation strategies, performance optimisation)

Alternative paths: - Build: Reverse ETL (close the loop) - Build: Data Governance (compliance and access control) - Maintain: Operations and Runbooks (day-to-day platform management)

Congratulations on completing the Observability section!

You've transformed a black-box data platform into an observable, debuggable, and reliable system. Your team can now detect issues early, debug problems quickly, and continuously improve data quality.