Finishing Up
On this page, you will:
- Review what you've built in the Observability section
- Understand the observability maturity model
- Learn when to upgrade from budget to premium tools
- Explore next steps beyond observability
- Access additional resources and community support
Summary
Congratulations! You've implemented a comprehensive observability stack for your modern data platform.
Observability Components
Observability Stack ✅
├── Data Quality
│ ├── dbt tests (schema.yml files with generic + custom tests) ✅
│ ├── dbt_expectations package (statistical validation) ✅
│ └── Elementary
│ ├── Elementary dbt package (test results tracking) ✅
│ ├── Elementary CLI (anomaly detection) ✅
│ └── Elementary UI (self-hosted or Elementary Cloud) ✅
│
├── Data Catalog
│ └── OpenMetadata (self-hosted on ECS)
│ ├── Snowflake connector (metadata extraction) ✅
│ ├── dbt connector (lineage from manifest.json) ✅
│ ├── Prefect connector (pipeline metadata) ✅
│ └── UI (search, lineage, data quality dashboard) ✅
│
├── Pipeline Monitoring
│ ├── Prefect Cloud UI (flow runs, states, logs) ✅
│ ├── Prefect automations (retry on failure, Slack alerts) ✅
│ └── Snowflake query history (performance profiling) ✅
│
├── Cost Monitoring
│ ├── Snowflake Resource Monitors (budget alerts) ✅
│ ├── AWS Cost Explorer (infrastructure spend) ✅
│ └── Cost allocation tags (Terraform-managed resources) ✅
│
└── Logging & Debugging
├── CloudWatch Logs (ECS services: Prefect, Airbyte, Lightdash) ✅
├── Snowflake query logs (QUERY_HISTORY view) ✅
└── dbt logs (run_results.json, compiled SQL) ✅
Key Capabilities
You can now:
Detect issues proactively: - Elementary catches volume, freshness, and schema anomalies - dbt tests validate data quality on every run - Prefect monitors pipeline health in real-time
Debug issues quickly: - OpenMetadata lineage traces data from source to dashboard - Centralised logs in CloudWatch show exactly what failed - Snowflake Query Profile identifies slow queries
Control costs: - Snowflake Resource Monitors prevent budget overruns - AWS Cost Explorer tracks infrastructure spending - Cost allocation tags enable team-level chargeback
Respond to incidents: - Runbooks provide step-by-step resolution guides - On-call rotations ensure 24/7 coverage for critical alerts - Post-incident reviews prevent recurrence
Observability Maturity Model
Your observability practice will evolve over time. Here's a maturity model to guide your growth.
Level 1: Reactive (Day 1-30)
Characteristics: - Manual monitoring — checking dashboards daily - Reacting to user reports of issues - Basic dbt tests (not_null, unique) - No automated alerting
What you've outgrown: - You've moved past this stage by implementing automated monitoring and alerting
Level 2: Proactive (Day 30-90) — You are here
Characteristics: - Automated alerting for pipeline failures - dbt tests on all critical tables - Elementary anomaly detection running - Slack/email notifications configured - Basic runbooks for common incidents
Next steps: - Expand test coverage to 80%+ of models - Add SLOs for critical pipelines - Refine anomaly detection to reduce false positives
Level 3: Predictive (Day 90-180)
Characteristics: - Machine learning-based anomaly detection - Forecasting data quality issues before they occur - Comprehensive SLI/SLO tracking - Automated remediation (auto-retry, auto-rollback) - Detailed post-incident reviews for all SEV1/SEV2 incidents
Tools to add: - Monte Carlo (ML-based data observability) - Datadog (unified monitoring across infrastructure and applications) - Custom ML models for anomaly prediction
Level 4: Optimised (Day 180+)
Characteristics: - Data quality SLAs with stakeholders - Continuous optimisation based on metrics - Cross-team collaboration on data quality - Automated data quality scoring - Self-healing pipelines
Advanced practices: - Data contracts between teams - Data product thinking (treat datasets as products with SLAs) - Automated incident classification and routing - Predictive capacity planning
How to Progress
From Level 2 to Level 3: 1. Track SLIs (success rate, latency, data freshness) for 3 months 2. Set SLOs based on historical performance (e.g., 99% uptime) 3. Implement automated remediation (Prefect automations for retries) 4. Expand anomaly detection to all fact/dimension tables 5. Conduct post-incident reviews for every SEV1/SEV2 incident
From Level 3 to Level 4: 1. Define data SLAs with business stakeholders 2. Implement data contracts (schemas, quality guarantees) 3. Build custom ML models for anomaly prediction 4. Create data product teams responsible for specific datasets 5. Measure and report on data quality KPIs monthly
When to Upgrade Tools
You've built the budget approach (~$80-100/month for observability). Here's when to upgrade to premium tools.
Elementary Cloud ($50+/month)
Upgrade when: - Self-hosted Elementary UI requires too much maintenance - Team grows to 10+ people (managed service worth the cost) - Need longer than 7-day log retention
Benefits:
- Zero infrastructure management
- Automatic updates
- Hosted at https://your-org.elementary-data.com
Budget impact: ~+$50/month
Monte Carlo ($$$, quote-based)
Upgrade when: - False positives from Elementary too high (Monte Carlo uses ML to reduce noise) - Need lineage across 10+ data sources - Require automated incident classification - Enterprise compliance needs (SOC 2, GDPR auditing)
Benefits: - ML-based anomaly detection (fewer false positives) - Multi-cloud support (Snowflake, BigQuery, Redshift, Databricks) - Automated root cause analysis - Field-level lineage
Budget impact: ~$1,000-5,000/month depending on data volume
Datadog ($15+/host/month)
Upgrade when: - Need unified monitoring for data platform + application infrastructure - Want APM (Application Performance Monitoring) for ECS services - Require advanced alerting with ML-based thresholds - Large engineering team (50+ people) already using Datadog
Benefits: - Unified dashboards for infrastructure, applications, and logs - APM for deep performance profiling - Advanced anomaly detection - Integrations with 500+ tools
Budget impact: ~$200-500/month for data platform infrastructure
Great Expectations Cloud ($$$, quote-based)
Upgrade when: - Data science team wants to define expectations in Python - Need collaborative expectation editing (UI for non-engineers) - Require centralised expectation management across teams - Want integration with dbt Cloud
Benefits: - Managed Great Expectations platform - UI for creating and editing expectations - Integration with data catalogs - Team collaboration features
Budget impact: ~$500-2,000/month
Stay on Budget Build If:
- Total team size <10 people
- Comfortable managing self-hosted infrastructure
- Cost-conscious (startup, early-stage)
- Elementary + dbt tests meet 90% of needs
The budget build provides 80% of the value at 10% of the cost. Upgrade only when specific pain points justify the expense.
Observability Metrics to Track
Measure observability effectiveness with these KPIs:
1. Pipeline Uptime
Definition: Percentage of pipeline runs that succeed
Target: ≥ 99% (SLO)
Calculation:
SELECT
COUNT(CASE WHEN status = 'success' THEN 1 END) * 100.0 / COUNT(*) AS uptime_pct
FROM dbt_run_history
WHERE run_date >= CURRENT_DATE() - INTERVAL '30 days';
Improvement actions: - If <95%: Focus on fixing flaky tests and unreliable API connections - If 95-99%: Optimise retry logic and add circuit breakers - If >99%: Maintain current practices
2. Mean Time to Detection (MTTD)
Definition: Average time from issue occurring to being detected
Target: < 1 hour for SEV1, < 4 hours for SEV2
Measurement: - Track time from pipeline failure to alert sent - Review incident logs monthly
Improvement actions: - Add Elementary freshness tests to detect stale data faster - Increase anomaly detection frequency (hourly instead of daily) - Add real-time monitors for critical tables
3. Mean Time to Resolution (MTTR)
Definition: Average time from detection to resolution
Target: < 2 hours for SEV1, < 1 day for SEV2
Measurement: - PagerDuty tracks this automatically - Manually log for non-PagerDuty incidents
Improvement actions: - Create runbooks for top 10 incident types - Implement automated remediation where possible - Improve logging for faster debugging
4. Data Quality Score
Definition: Percentage of dbt tests that pass
Target: ≥ 95%
Calculation:
SELECT
SUM(CASE WHEN status = 'pass' THEN 1 ELSE 0 END) * 100.0 / COUNT(*) AS quality_score_pct
FROM elementary.dbt_tests
WHERE test_execution_time >= CURRENT_DATE() - INTERVAL '7 days';
Improvement actions: - If <90%: Fix failing tests or adjust thresholds - If 90-95%: Expand test coverage to untested models - If >95%: Add anomaly detection for proactive monitoring
5. Cost Efficiency
Definition: Snowflake credit cost per row processed
Target: Stable or decreasing over time
Calculation:
WITH monthly_stats AS (
SELECT
DATE_TRUNC('month', run_date) AS month,
SUM(credits_used) AS total_credits,
SUM(rows_processed) AS total_rows
FROM dbt_run_history
GROUP BY month
)
SELECT
month,
total_credits,
total_rows,
total_credits / total_rows AS credits_per_row
FROM monthly_stats
ORDER BY month DESC;
Improvement actions: - If increasing: Optimise slow queries, right-size warehouses - If stable: Monitor for unexpected spikes - If decreasing: Continue current optimisation efforts
What's Next
You've completed the Observability section. Here are logical next steps:
Option 1: Build Reverse ETL
What: Send data from Snowflake back to operational systems (CRM, marketing tools)
Why: Close the loop — insights from analytics feed back into operations
Tools: Census, Hightouch, or Airbyte Reverse ETL
Documentation: Build: Reverse ETL (future section)
Option 2: Implement Data Governance
What: Define ownership, access policies, and data classification
Why: Ensure compliance (GDPR, CCPA) and prevent unauthorised data access
Tools: OpenMetadata (governance features), Snowflake tags, AWS IAM policies
Documentation: Build: Data Governance (future section)
Option 3: Add Machine Learning
What: Train ML models on your data warehouse
Why: Predictive analytics, churn prediction, recommendation engines
Tools: Snowflake Snowpark (Python in Snowflake), AWS SageMaker, dbt Python models
Documentation: Build: Machine Learning (future section)
Option 4: Scale Infrastructure
What: Multi-region deployment, disaster recovery, high availability
Why: Production-grade reliability for mission-critical data
Documentation: Maintain: Disaster Recovery (future section)
Option 5: Optimise for Performance
What: Query optimisation, clustering keys, materialised views
Why: Faster dashboards, lower costs, better user experience
Estimated effort: Ongoing
Documentation: Maintain: Performance Optimisation (future section)
Common Challenges and Solutions
Challenge: Alert Fatigue
Symptom: Too many alerts, team ignores them
Solution: 1. Review alerts weekly — disable noisy, low-value alerts 2. Increase anomaly detection thresholds (sensitivity: 3 → 4) 3. Aggregate alerts — send daily summary instead of real-time alerts for SEV3
Challenge: Runbook Drift
Symptom: Runbooks become outdated, don't match current system
Solution: 1. Review runbooks quarterly 2. Update runbooks after every incident (part of post-incident review) 3. Version control runbooks in Git
Challenge: Cost Creep
Symptom: Costs increasing 10-20% month-over-month
Solution: 1. Set up AWS Budgets and Snowflake Resource Monitors (you've done this!) 2. Review top 10 most expensive queries monthly 3. Archive cold data to S3 after 2 years
Challenge: Siloed Observability
Symptom: Each team has their own monitoring tools, no unified view
Solution: 1. Centralise logs in CloudWatch 2. Use OpenMetadata as single source of truth for metadata 3. Create shared Slack channels for alerts (#data-alerts, #data-critical)
Challenge: No Time for Observability
Symptom: "We're too busy building features to invest in observability"
Solution: 1. Start small — add Elementary to one critical table 2. Demonstrate value — show how one anomaly detection prevented an incident 3. Make observability part of definition of done (every new model must have tests)
Resources
Documentation
- dbt Testing Documentation
- Elementary Documentation
- OpenMetadata Documentation
- Prefect Documentation
- Snowflake Observability
Community
- dbt Slack Community — #advice-dbt-tests channel
- Locally Optimistic Slack — Data professionals community
- Data Engineering Subreddit
- Prefect Slack Community
Courses
- DataCamp: Data Quality in Python
- Udemy: dbt Fundamentals
- Elementary Academy — Free courses on data observability
Blogs
Summary
You've completed the Observability section:
- Data quality testing — dbt tests, dbt_expectations, Elementary
- Data cataloging — OpenMetadata with lineage, search, and governance
- Pipeline monitoring — Prefect automations, Slack alerts, custom metrics
- Cost monitoring — Snowflake Resource Monitors, AWS Budgets, cost attribution
- Alerting and incidents — runbooks, on-call rotations, post-incident reviews
- Logging and debugging — CloudWatch Logs, query profiling, systematic troubleshooting
- Anomaly detection — Elementary automated detection, custom SQL checks, Great Expectations
Total monthly cost: ~$80-100 (Elementary + OpenMetadata self-hosted + CloudWatch Logs)
You've built a production-grade observability stack that provides: - Proactive issue detection — catch problems before users notice - Fast debugging — resolve incidents in minutes, not hours - Cost control — prevent surprise bills - Continuous improvement — learn from incidents and prevent recurrence
What's Next
Your modern data stack is now observable, reliable, and cost-efficient.
Next recommended section: Build: Data Transformation (Advanced dbt patterns, materialisation strategies, performance optimisation)
Alternative paths: - Build: Reverse ETL (close the loop) - Build: Data Governance (compliance and access control) - Maintain: Operations and Runbooks (day-to-day platform management)
Congratulations on completing the Observability section!
You've transformed a black-box data platform into an observable, debuggable, and reliable system. Your team can now detect issues early, debug problems quickly, and continuously improve data quality.