S3 Data Lake
On this page, you will:
- Build the
s3_bucketmodule with versioning, lifecycle, and IAM policies - Create data lake buckets for dev, staging, and prod environments
- Understand the bucket configuration and access patterns
What is a Data Lake?
A data lake is a centralised storage repository that holds raw data in its native format until needed. Unlike a data warehouse (which stores processed, structured data), a data lake stores:
- Raw source data: Extracts from APIs, databases, and files
- Intermediate files: Staging data during transformations
- Exports: Query results and reports exported from Snowflake
For our data platform, S3 serves as the data lake, with Snowflake storage integrations providing secure access.
The S3 Bucket Module
This module creates S3 buckets with production-ready configuration including versioning, encryption, lifecycle policies, and pre-built IAM policies.
Create the module directory:
mkdir -p modules/s3_bucket
main.tf
Create modules/s3_bucket/main.tf:
terraform {
required_providers {
aws = {
source = "hashicorp/aws"
version = "~> 5.0"
}
}
}
# -----------------------------------------------------------------------------
# S3 Bucket
# -----------------------------------------------------------------------------
resource "aws_s3_bucket" "this" {
bucket = var.bucket_name
tags = merge(var.tags, {
Name = var.bucket_name
Environment = var.environment
ManagedBy = "terraform"
})
}
# -----------------------------------------------------------------------------
# Versioning
# -----------------------------------------------------------------------------
resource "aws_s3_bucket_versioning" "this" {
bucket = aws_s3_bucket.this.id
versioning_configuration {
status = var.versioning_enabled ? "Enabled" : "Suspended"
}
}
# -----------------------------------------------------------------------------
# Encryption
# -----------------------------------------------------------------------------
resource "aws_s3_bucket_server_side_encryption_configuration" "this" {
bucket = aws_s3_bucket.this.id
rule {
apply_server_side_encryption_by_default {
sse_algorithm = "AES256"
}
bucket_key_enabled = true
}
}
# -----------------------------------------------------------------------------
# Public Access Block
# -----------------------------------------------------------------------------
resource "aws_s3_bucket_public_access_block" "this" {
bucket = aws_s3_bucket.this.id
block_public_acls = true
block_public_policy = true
ignore_public_acls = true
restrict_public_buckets = true
}
# -----------------------------------------------------------------------------
# Lifecycle Rules
# -----------------------------------------------------------------------------
resource "aws_s3_bucket_lifecycle_configuration" "this" {
bucket = aws_s3_bucket.this.id
# Clean up incomplete multipart uploads
rule {
id = "abort-incomplete-multipart-uploads"
status = "Enabled"
abort_incomplete_multipart_upload {
days_after_initiation = 7
}
}
# Move old versions to cheaper storage, then delete
dynamic "rule" {
for_each = var.versioning_enabled ? [1] : []
content {
id = "noncurrent-version-management"
status = "Enabled"
noncurrent_version_transition {
noncurrent_days = var.noncurrent_version_transition_days
storage_class = "STANDARD_IA"
}
noncurrent_version_expiration {
noncurrent_days = var.noncurrent_version_expiration_days
}
}
}
# Optional: expire objects with specific prefix
dynamic "rule" {
for_each = var.temp_prefix_expiration_days != null ? [1] : []
content {
id = "temp-prefix-expiration"
status = "Enabled"
filter {
prefix = "temp/"
}
expiration {
days = var.temp_prefix_expiration_days
}
}
}
}
# -----------------------------------------------------------------------------
# IAM Policy Documents
# -----------------------------------------------------------------------------
# Read-only policy - for Snowflake read-only integrations
data "aws_iam_policy_document" "read" {
statement {
sid = "AllowListBucket"
effect = "Allow"
actions = [
"s3:ListBucket",
"s3:GetBucketLocation"
]
resources = [aws_s3_bucket.this.arn]
}
statement {
sid = "AllowReadObjects"
effect = "Allow"
actions = [
"s3:GetObject",
"s3:GetObjectVersion"
]
resources = ["${aws_s3_bucket.this.arn}/*"]
}
}
# Read-write policy - for Snowflake full integrations
data "aws_iam_policy_document" "write" {
statement {
sid = "AllowListBucket"
effect = "Allow"
actions = [
"s3:ListBucket",
"s3:GetBucketLocation"
]
resources = [aws_s3_bucket.this.arn]
}
statement {
sid = "AllowReadWriteObjects"
effect = "Allow"
actions = [
"s3:GetObject",
"s3:GetObjectVersion",
"s3:PutObject",
"s3:DeleteObject"
]
resources = ["${aws_s3_bucket.this.arn}/*"]
}
}
# -----------------------------------------------------------------------------
# IAM Policies (for attachment to roles)
# -----------------------------------------------------------------------------
resource "aws_iam_policy" "read" {
count = var.create_iam_policies ? 1 : 0
name = "${var.bucket_name}-read"
description = "Read-only access to ${var.bucket_name} S3 bucket"
policy = data.aws_iam_policy_document.read.json
tags = var.tags
}
resource "aws_iam_policy" "write" {
count = var.create_iam_policies ? 1 : 0
name = "${var.bucket_name}-write"
description = "Read-write access to ${var.bucket_name} S3 bucket"
policy = data.aws_iam_policy_document.write.json
tags = var.tags
}
variables.tf
Create modules/s3_bucket/variables.tf:
variable "bucket_name" {
description = "Name of the S3 bucket (must be globally unique)"
type = string
}
variable "environment" {
description = "Environment name (dev, staging, prod)"
type = string
validation {
condition = contains(["dev", "staging", "prod"], var.environment)
error_message = "Environment must be dev, staging, or prod."
}
}
variable "versioning_enabled" {
description = "Enable versioning on the bucket"
type = bool
default = true
}
variable "noncurrent_version_transition_days" {
description = "Days before moving noncurrent versions to STANDARD_IA"
type = number
default = 30
}
variable "noncurrent_version_expiration_days" {
description = "Days before deleting noncurrent versions"
type = number
default = 90
}
variable "temp_prefix_expiration_days" {
description = "Days before expiring objects with temp/ prefix (null to disable)"
type = number
default = 7
}
variable "create_iam_policies" {
description = "Create IAM policies for read and write access"
type = bool
default = true
}
variable "tags" {
description = "Additional tags for the bucket"
type = map(string)
default = {}
}
outputs.tf
Create modules/s3_bucket/outputs.tf:
output "bucket_id" {
description = "The name of the bucket"
value = aws_s3_bucket.this.id
}
output "bucket_arn" {
description = "The ARN of the bucket"
value = aws_s3_bucket.this.arn
}
output "bucket_domain_name" {
description = "The bucket domain name"
value = aws_s3_bucket.this.bucket_domain_name
}
output "bucket_regional_domain_name" {
description = "The bucket region-specific domain name"
value = aws_s3_bucket.this.bucket_regional_domain_name
}
# IAM policy outputs
output "read_policy_arn" {
description = "ARN of the read-only IAM policy"
value = var.create_iam_policies ? aws_iam_policy.read[0].arn : null
}
output "read_policy_json" {
description = "JSON of the read-only IAM policy document"
value = data.aws_iam_policy_document.read.json
}
output "write_policy_arn" {
description = "ARN of the read-write IAM policy"
value = var.create_iam_policies ? aws_iam_policy.write[0].arn : null
}
output "write_policy_json" {
description = "JSON of the read-write IAM policy document"
value = data.aws_iam_policy_document.write.json
}
Create Data Lake Buckets
Now use the module to create data lake buckets for each environment.
Add Variables
Add to variables.tf:
variable "project_name" {
description = "Project name used for resource naming"
type = string
}
variable "data_lake_environments" {
description = "Environments to create data lake buckets for"
type = list(string)
default = ["dev", "staging", "prod"]
}
Configure in terraform.tfvars:
project_name = "mycompany" # Replace with your project/company name
Create the Buckets
Create s3_data_lake.tf:
# =============================================================================
# Data Lake S3 Buckets
# =============================================================================
# S3 buckets for Snowflake data loading and unloading.
# One bucket per environment (dev, staging, prod).
module "data_lake" {
source = "./modules/s3_bucket"
for_each = toset(var.data_lake_environments)
bucket_name = "${var.project_name}-data-lake-${each.value}"
environment = each.value
# Versioning protects against accidental deletions
versioning_enabled = true
# Lifecycle settings
noncurrent_version_transition_days = 30 # Move old versions to IA after 30 days
noncurrent_version_expiration_days = 90 # Delete old versions after 90 days
temp_prefix_expiration_days = 7 # Clean up temp/ files after 7 days
# Create IAM policies for Snowflake integration roles
create_iam_policies = true
tags = {
Purpose = "Snowflake data lake storage"
}
}
# -----------------------------------------------------------------------------
# Outputs
# -----------------------------------------------------------------------------
output "data_lake_buckets" {
description = "Data lake bucket details by environment"
value = {
for env, bucket in module.data_lake : env => {
id = bucket.bucket_id
arn = bucket.bucket_arn
read_policy_arn = bucket.read_policy_arn
write_policy_arn = bucket.write_policy_arn
}
}
}
Understanding the Configuration
Versioning
Versioning is enabled by default. This means:
- Every overwrite creates a new version (original is preserved)
- Deleted objects can be recovered
- Lifecycle rules clean up old versions to control costs
Encryption
All objects are encrypted at rest using AES-256 (SSE-S3). This is AWS's default encryption and has no additional cost. For sensitive data, you can switch to SSE-KMS with a customer-managed key.
Public Access Block
All four public access settings are blocked:
block_public_acls- Blocks public ACLs on objectsblock_public_policy- Blocks public bucket policiesignore_public_acls- Ignores existing public ACLsrestrict_public_buckets- Restricts access to AWS principals only
Lifecycle Rules
The module includes three lifecycle rules:
- Abort incomplete uploads: Cleans up failed multipart uploads after 7 days
- Noncurrent version management: Moves old versions to cheaper storage, then deletes
- Temp prefix expiration: Automatically deletes files in
temp/after 7 days
IAM Policies
The module creates two IAM policies:
| Policy | Permissions | Use Case |
|---|---|---|
*-read |
ListBucket, GetObject, GetObjectVersion | Snowflake read-only integrations |
*-write |
ListBucket, GetObject, PutObject, DeleteObject | Snowflake full integrations |
These policies are attached to Snowflake IAM roles in the Storage Integrations section.
Commit and Deploy
Commit your changes and push:
git add modules/s3_bucket/ s3_data_lake.tf variables.tf terraform.tfvars
git commit -m "Add S3 bucket module and data lake buckets"
git push
Verify in AWS
After deployment, verify the buckets:
# List buckets
aws s3 ls --profile data-engineer | grep data-lake
# Check bucket configuration
aws s3api get-bucket-versioning --profile data-engineer --bucket mycompany-data-lake-dev
aws s3api get-bucket-encryption --profile data-engineer --bucket mycompany-data-lake-dev
aws s3api get-public-access-block --profile data-engineer --bucket mycompany-data-lake-dev
# List IAM policies
aws iam list-policies --profile data-engineer --scope Local | grep data-lake
Summary
You've created the S3 infrastructure for your data platform:
- Built the
s3_bucketmodule with versioning, encryption, and lifecycle policies - Created data lake buckets for dev, staging, and prod environments
- Generated IAM policies for Snowflake integration roles
What's Next
The S3 buckets are ready for Snowflake to access. Next, create the storage integrations that connect Snowflake to these buckets.
Continue to Storage Integrations →