Data Cataloging with OpenMetadata
On this page, you will:
- Understand what a data catalog is and why it matters
- Deploy OpenMetadata to AWS ECS using Terraform
- Connect OpenMetadata to Snowflake for metadata extraction
- Connect OpenMetadata to dbt for lineage and documentation
- Connect OpenMetadata to Prefect for pipeline metadata
- Search and discover data assets across your entire stack
Overview
A data catalog is a centralized metadata repository that helps users discover, understand, and trust data. It answers questions like:
- "What tables exist in our warehouse?"
- "Where does this column come from?" (lineage)
- "Who owns this dataset?"
- "What tests validate this data?"
- "Which dashboards use this table?"
OpenMetadata is an open source data catalog that integrates with dbt, Snowflake, Prefect, and BI tools to provide a unified view of your data platform.
┌─────────────────────────────────────────────────────────────────────────┐
│ OPENMETADATA ARCHITECTURE │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Data Sources OpenMetadata Users │
│ ──────────── ──────────── ───── │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Snowflake │─────────▶│ Metadata DB │────────▶│ Data │ │
│ │ • Tables │ │ (PostgreSQL) │ │ Analysts │ │
│ │ • Views │ │ │ │ │ │
│ │ • Usage │ │ ┌──────────┐ │ │ Search: │ │
│ └──────────────┘ │ │ Lineage │ │ │ "customer" │ │
│ │ │ Docs │ │ │ │ │
│ ┌──────────────┐ │ │ Quality │ │ │ Results: │ │
│ │ dbt │─────────▶│ │ Owners │ │ │ • dim_ │ │
│ │ • Models │ │ └──────────┘ │ │ customers │ │
│ │ • Tests │ │ │ │ • fct_orders │ │
│ │ • Lineage │ └──────────────┘ └──────────────┘ │
│ └──────────────┘ │ │
│ │ │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ Prefect │─────────▶│ OpenMetadata │ │
│ │ • Flows │ │ UI │ │
│ │ • Tasks │ │ • Search │ │
│ │ • Runs │ │ • Lineage │ │
│ └──────────────┘ │ • Quality │ │
│ └──────────────┘ │
│ │
│ Ingestion runs daily: Extract metadata → Store in OpenMetadata │
│ │
└─────────────────────────────────────────────────────────────────────────┘
What OpenMetadata Provides
1. Data Discovery
Search across all data assets: - Tables, views, dashboards, pipelines - Full-text search on names, descriptions, columns - Filter by tags, owners, tiers, domains
Example: Search "revenue" → find fct_revenue, rpt_revenue_summary, "Revenue Dashboard" (Lightdash), dbt model int_revenue__daily
2. Lineage
Column-level lineage: - Trace data from source → staging → marts → dashboards - Understand transformations at each step - Impact analysis: "What breaks if I change this column?"
Example: dim_customers.email ← stg_hubspot__contacts.email_address ← raw_hubspot.contacts.properties_email
3. Data Quality
Test results from dbt: - All dbt tests (unique, not_null, custom) - Test pass/fail history - Elementary anomaly detection results
Example: View that customer_id has a "unique" test with 100% pass rate over 30 days.
4. Ownership and Documentation
Data governance: - Assign owners to tables, dashboards, pipelines - Tag datasets by tier (Tier 1 = critical, Tier 2 = important, Tier 3 = nice-to-have) - Add descriptions and documentation
Example: fct_orders owned by "Data Engineering Team", Tier 1, description links to Confluence page.
5. Usage Analytics
Who queries what: - Top queries by user - Most-used tables - Unused tables (candidates for deprecation)
Example: dim_customers queried 1,250 times last week by 15 users; dim_products_v1 not queried in 90 days (safe to drop).
OpenMetadata vs Alternatives
| Feature | OpenMetadata | Atlan | Amundsen | DataHub | Marquez |
|---|---|---|---|---|---|
| Cost | Free (OSS) | $$$ Enterprise | Free (OSS) | Free (OSS) | Free (OSS) |
| dbt integration | ✅ Excellent | ✅ Excellent | ⚠️ Basic | ✅ Good | ❌ None |
| Snowflake integration | ✅ Native | ✅ Native | ✅ Via plugin | ✅ Native | ⚠️ Basic |
| Column lineage | ✅ Yes | ✅ Yes | ❌ No | ✅ Yes | ⚠️ Basic |
| UI polish | ⚠️ Good | ✅ Excellent | ⚠️ Basic | ⚠️ Good | ⚠️ Basic |
| Active development | ✅ Very active | ✅ Commercial | ⚠️ Slow | ✅ Active | ⚠️ Moderate |
| Self-hosting ease | ✅ Docker/K8s | ❌ Cloud only | ⚠️ Complex | ⚠️ Complex | ✅ Simple |
Why OpenMetadata for this project: - Free and open source - Excellent dbt integration (reads manifest.json natively) - Active development and community - Relatively easy to self-host - Good balance of features vs complexity
Prerequisites
Before deploying OpenMetadata:
- VPC with public and private subnets (from VPC Setup)
- dbt project with models and tests (from Data Transformation)
- Snowflake with data (from Data Warehouse)
- Prefect running pipelines (from Orchestration)
Step 1: Deploy OpenMetadata to ECS
OpenMetadata consists of three services: 1. OpenMetadata Server (backend API) 2. OpenMetadata UI (web interface) 3. OpenMetadata Ingestion (metadata extraction workers)
All three run as Docker containers on ECS.
Create RDS PostgreSQL for Metadata
OpenMetadata stores metadata in PostgreSQL.
# terraform/aws/openmetadata-rds.tf
resource "aws_db_subnet_group" "openmetadata" {
name = "openmetadata-db-subnet-group"
subnet_ids = module.vpc.private_subnet_ids
tags = {
Name = "OpenMetadata DB Subnet Group"
ManagedBy = "terraform"
}
}
resource "aws_security_group" "openmetadata_rds" {
name = "openmetadata-rds-sg"
description = "Security group for OpenMetadata RDS"
vpc_id = module.vpc.vpc_id
ingress {
description = "PostgreSQL from ECS"
from_port = 5432
to_port = 5432
protocol = "tcp"
security_groups = [aws_security_group.openmetadata_ecs.id]
}
egress {
from_port = 0
to_port = 0
protocol = "-1"
cidr_blocks = ["0.0.0.0/0"]
}
tags = {
Name = "OpenMetadata RDS SG"
ManagedBy = "terraform"
}
}
resource "random_password" "openmetadata_db_password" {
length = 32
special = true
}
resource "aws_secretsmanager_secret" "openmetadata_db_password" {
name = "openmetadata/database-password"
}
resource "aws_secretsmanager_secret_version" "openmetadata_db_password" {
secret_id = aws_secretsmanager_secret.openmetadata_db_password.id
secret_string = random_password.openmetadata_db_password.result
}
resource "aws_db_instance" "openmetadata" {
identifier = "openmetadata-db"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.t3.small" # Larger than Lightdash (more metadata)
allocated_storage = 50
max_allocated_storage = 200
storage_type = "gp3"
storage_encrypted = true
db_name = "openmetadata"
username = "openmetadata"
password = random_password.openmetadata_db_password.result
db_subnet_group_name = aws_db_subnet_group.openmetadata.name
vpc_security_group_ids = [aws_security_group.openmetadata_rds.id]
backup_retention_period = 7
backup_window = "03:00-04:00"
maintenance_window = "sun:04:00-sun:05:00"
skip_final_snapshot = false
final_snapshot_identifier = "openmetadata-final-${formatdate("YYYY-MM-DD-hhmm", timestamp())}"
tags = {
Name = "OpenMetadata PostgreSQL"
ManagedBy = "terraform"
}
}
Deploy OpenMetadata Server (ECS)
# terraform/aws/openmetadata-ecs.tf
resource "aws_ecs_cluster" "openmetadata" {
name = "openmetadata-cluster"
setting {
name = "containerInsights"
value = "enabled"
}
}
resource "aws_ecs_task_definition" "openmetadata_server" {
family = "openmetadata-server"
network_mode = "awsvpc"
requires_compatibilities = ["FARGATE"]
cpu = "2048"
memory = "4096"
execution_role_arn = aws_iam_role.ecs_task_execution_role.arn
task_role_arn = aws_iam_role.openmetadata_task_role.arn
container_definitions = jsonencode([{
name = "openmetadata-server"
image = "openmetadata/server:1.3.0"
portMappings = [{
containerPort = 8585
protocol = "tcp"
}]
environment = [
{ name = "DB_DRIVER_CLASS", value = "org.postgresql.Driver" },
{ name = "DB_SCHEME", value = "postgresql" },
{ name = "DB_HOST", value = aws_db_instance.openmetadata.address },
{ name = "DB_PORT", value = "5432" },
{ name = "DB_DATABASE", value = "openmetadata" },
{ name = "DB_USER", value = "openmetadata" },
{ name = "AUTHENTICATION_PROVIDER", value = "basic" },
{ name = "AUTHORIZER_CLASS_NAME", value = "org.openmetadata.service.security.DefaultAuthorizer" }
]
secrets = [
{
name = "DB_USER_PASSWORD"
valueFrom = aws_secretsmanager_secret.openmetadata_db_password.arn
}
]
logConfiguration = {
logDriver = "awslogs"
options = {
"awslogs-group" = aws_cloudwatch_log_group.openmetadata.name
"awslogs-region" = data.aws_region.current.name
"awslogs-stream-prefix" = "server"
}
}
healthCheck = {
command = ["CMD-SHELL", "wget --quiet --tries=1 --spider http://localhost:8585/api/v1/health || exit 1"]
interval = 30
timeout = 5
retries = 3
startPeriod = 120
}
}])
}
resource "aws_ecs_service" "openmetadata_server" {
name = "openmetadata-server"
cluster = aws_ecs_cluster.openmetadata.id
task_definition = aws_ecs_task_definition.openmetadata_server.arn
desired_count = 1
launch_type = "FARGATE"
network_configuration {
subnets = module.vpc.private_subnet_ids
security_groups = [aws_security_group.openmetadata_ecs.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.openmetadata.arn
container_name = "openmetadata-server"
container_port = 8585
}
}
Create ALB for HTTPS Access
# terraform/aws/openmetadata-alb.tf
resource "aws_lb" "openmetadata" {
name = "openmetadata-alb"
internal = false
load_balancer_type = "application"
security_groups = [aws_security_group.openmetadata_alb.id]
subnets = module.vpc.public_subnet_ids
tags = {
Name = "OpenMetadata ALB"
ManagedBy = "terraform"
}
}
resource "aws_lb_target_group" "openmetadata" {
name = "openmetadata-tg"
port = 8585
protocol = "HTTP"
vpc_id = module.vpc.vpc_id
target_type = "ip"
health_check {
path = "/api/v1/health"
interval = 30
timeout = 5
healthy_threshold = 2
unhealthy_threshold = 3
matcher = "200"
}
}
resource "aws_lb_listener" "openmetadata_https" {
load_balancer_arn = aws_lb.openmetadata.arn
port = 443
protocol = "HTTPS"
ssl_policy = "ELBSecurityPolicy-TLS13-1-2-2021-06"
certificate_arn = aws_acm_certificate.openmetadata.arn
default_action {
type = "forward"
target_group_arn = aws_lb_target_group.openmetadata.arn
}
}
output "openmetadata_url" {
value = "https://${aws_lb.openmetadata.dns_name}"
}
Apply Terraform
cd terraform/aws
terraform plan
terraform apply
Once provisioning completes, OpenMetadata is accessible at the ALB URL.
Navigate to https://openmetadata-alb-xxxxx.eu-west-2.elb.amazonaws.com and log in with default credentials:
- Username: admin
- Password: admin
Change the password immediately in Settings → Users.
Step 2: Create SVC_OPENMETADATA Service Account
OpenMetadata needs read-only access to Snowflake's ACCOUNT_USAGE schema for metadata extraction.
# terraform/snowflake/service-accounts/openmetadata.tf
module "service_user_openmetadata" {
source = "../modules/snowflake_service_user"
username = "SVC_OPENMETADATA"
comment = "Service account for OpenMetadata data catalog"
email = "data-platform@yourcompany.com"
rsa_public_key = var.svc_openmetadata_public_key
user_create_dedicated_role = true
dedicated_role_grants = [
"ANALYTICS_REPORTER" # Read REPORTING schema for table metadata
]
default_warehouse = "REPORTING"
default_namespace = "ANALYTICS.REPORTING"
default_role = "SVC_OPENMETADATA"
tags = {
Service = "openmetadata"
ManagedBy = "terraform"
}
}
# Grant access to ACCOUNT_USAGE for query history and metadata
resource "snowflake_grant_privileges_to_account_role" "openmetadata_account_usage" {
account_role_name = module.service_user_openmetadata.dedicated_role_name
privileges = ["IMPORTED PRIVILEGES"]
on_account_object {
object_type = "DATABASE"
object_name = "SNOWFLAKE"
}
}
This grants SVC_OPENMETADATA:
- Read access to ANALYTICS.REPORTING (table metadata, column names, types)
- Read access to SNOWFLAKE.ACCOUNT_USAGE (query history, table usage statistics)
Step 3: Connect OpenMetadata to Snowflake
Add Snowflake Service
- Log in to OpenMetadata UI
- Navigate to Settings → Services → Databases
- Click Add Database Service
- Select Snowflake
Configure:
| Field | Value |
|---|---|
| Name | snowflake-analytics |
| Account | your-account.snowflakecomputing.com |
| Username | SVC_OPENMETADATA |
| Password | (leave blank, using key-pair) |
| Private Key | Paste contents of svc_openmetadata_rsa_key.pem |
| Warehouse | REPORTING |
| Database | ANALYTICS |
| Schema | REPORTING (or leave blank to ingest all schemas) |
Click Test Connection → Should succeed.
Click Save.
Configure Metadata Ingestion
- Click on the
snowflake-analyticsservice - Navigate to Ingestion tab
- Click Add Ingestion → Metadata
Configure:
| Field | Value |
|---|---|
| Name | snowflake-metadata-daily |
| Database Filter | ANALYTICS |
| Schema Filter | REPORTING, MARTS (comma-separated) |
| Include Views | ✅ Yes |
| Include Tables | ✅ Yes |
| Mark Deleted Tables | ✅ Yes (soft delete in OpenMetadata if table is dropped) |
Schedule: - Frequency: Daily - Time: 02:00 UTC (after dbt runs complete)
Click Deploy.
OpenMetadata will now extract metadata from Snowflake daily at 02:00 UTC.
Verify Snowflake Ingestion
Navigate to Explore → Tables.
You should see:
- analytics.reporting.fct_exchange_rates
- analytics.reporting.dim_products
- analytics.marts.fct_exchange_rates
- etc.
Click on a table to see: - Schema (column names, types, descriptions) - Sample Data (first 10 rows) - Queries (who queried this table, when) - Lineage (empty for now — added in next step with dbt)
Step 4: Connect OpenMetadata to dbt
OpenMetadata reads dbt's manifest.json, catalog.json, and run_results.json to extract:
- Model descriptions
- Column descriptions
- dbt tests
- Lineage (which models depend on which)
Option A: GitHub Integration (Recommended)
OpenMetadata can clone your dbt repository and read artifacts.
- Navigate to Settings → Services → Pipelines
- Click Add Pipeline Service
- Select dbt
Configure:
| Field | Value |
|---|---|
| Name | dbt-transform |
| dbt Config Source | GitHub |
| Repository URL | https://github.com/your-org/dbt-transform |
| Branch | main |
| dbt Project Path | . (root of repo) |
| GitHub Token | Personal access token with repo scope |
dbt Catalog Path: target/catalog.json
dbt Manifest Path: target/manifest.json
dbt Run Results Path: target/run_results.json
Click Test Connection → Save.
Configure dbt Metadata Ingestion
- Click on
dbt-transformservice - Navigate to Ingestion tab
- Click Add Ingestion → Metadata
Configure:
| Field | Value |
|---|---|
| Name | dbt-metadata-daily |
| dbt Update Descriptions | ✅ Yes (overwrite Snowflake descriptions with dbt descriptions) |
| Include dbt Tags | ✅ Yes |
Schedule: - Frequency: Daily - Time: 09:00 UTC (after dbt runs and uploads artifacts to GitHub)
Click Deploy.
Option B: Upload Artifacts Manually
If you don't want GitHub integration:
- After
dbt run, uploadtarget/manifest.json,target/catalog.json,target/run_results.jsonto S3 - Configure OpenMetadata to read from S3 (HTTP URL)
Verify dbt Ingestion
Navigate to Explore → Tables → fct_exchange_rates.
You should now see:
- Description from dbt YAML
- Column descriptions from dbt YAML
- Tests (unique, not_null, dbt_expectations tests)
- Lineage (upstream: stg_dlt__exchange_rates, downstream: dashboards)
Click Lineage tab to see the dbt DAG.
Step 5: Connect OpenMetadata to Prefect (Optional)
Extract pipeline metadata from Prefect (flows, tasks, runs).
Add Prefect Service
- Navigate to Settings → Services → Pipelines
- Click Add Pipeline Service
- Select Prefect
Configure:
| Field | Value |
|---|---|
| Name | prefect-orchestration |
| Host Port | https://api.prefect.cloud (or self-hosted URL) |
| API Key | Prefect API key (create in Prefect Cloud → API Keys) |
Click Test Connection → Save.
Configure Prefect Metadata Ingestion
- Click on
prefect-orchestrationservice - Ingestion → Add Ingestion → Metadata
Schedule: Daily at 10:00 UTC.
Now you can see Prefect flows in OpenMetadata and their relationship to dbt models.
Step 6: Search and Discovery
Global Search
Use the search bar at the top of OpenMetadata UI:
Search: customer
Results:
- dim_customers (table)
- stg_hubspot__contacts (table)
- customers column in fct_orders (column)
- "Customer Segmentation" (dashboard, if connected)
- ingest_hubspot_contacts (Prefect flow)
Advanced Filters
Filter search results by: - Service (Snowflake, dbt, Prefect) - Tier (Tier 1 = critical, Tier 2 = important, Tier 3 = experimental) - Owner (assign owners to datasets) - Tags (PII, GDPR, revenue, marketing)
Assign Ownership
- Navigate to a table (e.g.,
fct_exchange_rates) - Click Edit → Ownership
- Select Owner: "Data Engineering Team" (create teams in Settings → Teams)
- Select Tier: Tier 1 (critical)
- Add tags:
revenue,finance,daily
Now analysts can filter: "Show me all Tier 1 tables owned by Data Engineering."
Step 7: Automate Ingestion via CLI
Run OpenMetadata ingestion from the command line (for CI/CD integration).
Install OpenMetadata CLI
uv add "openmetadata-ingestion[snowflake,dbt]"
Create Ingestion Config
# ingestion/snowflake-config.yaml
source:
type: snowflake
serviceName: snowflake-analytics
serviceConnection:
config:
type: Snowflake
username: SVC_OPENMETADATA
privateKey: /path/to/svc_openmetadata_rsa_key.pem
snowflakeAccount: your-account.snowflakecomputing.com
warehouse: REPORTING
database: ANALYTICS
sourceConfig:
config:
type: DatabaseMetadata
schemaFilterPattern:
includes:
- REPORTING
- MARTS
sink:
type: metadata-rest
config:
workflowConfig:
openMetadataServerConfig:
hostPort: https://your-openmetadata-url.com
authProvider: openmetadata
securityConfig:
jwtToken: "YOUR_JWT_TOKEN"
workflowConfig:
loggerLevel: INFO
openMetadataServerConfig:
hostPort: https://your-openmetadata-url.com
Run Ingestion
metadata ingest -c ingestion/snowflake-config.yaml
This can be added to Prefect as a task to run after dbt.
Cost Summary
| Component | Monthly Cost |
|---|---|
| RDS PostgreSQL (db.t3.small) | ~$30 |
| ECS Fargate (2 vCPU, 4GB RAM) | ~$30 |
| ALB | ~$20 |
| Total | ~$80/month |
This provides a complete data catalog for your entire stack.
Summary
You've deployed OpenMetadata:
- OpenMetadata deployed to ECS with PostgreSQL backend
- Snowflake connected — metadata extraction from
ANALYTICSdatabase - dbt connected — lineage, tests, and documentation imported
- Prefect connected (optional) — pipeline metadata
- Search and discovery — find tables, columns, dashboards across all systems
- Ownership and tags — governance with teams, tiers, and tags
OpenMetadata is now the central catalog for your data platform, unifying metadata from Snowflake, dbt, and Prefect.
What's Next
Use OpenMetadata's lineage features to trace data flow and perform impact analysis.
Continue to Data Lineage →