Data Storage Layers¶

Reference for SAs discussing data organization, open table formats, and the medallion architecture with customers. Covers what lives where, why, and how to talk about it without getting lost in file format details.

The Medallion Architecture¶

The most common organizational pattern for lakehouse storage. Data flows through three progressively refined layers:

flowchart LR
    subgraph Sources
        S1[(Operational DBs)]
        S2[Files / APIs]
        S3[Streams]
    end
    subgraph BRONZE["🥉 Bronze — Raw"]
        B1[Exact copy of source\nNo transformation\nAppend-only or full load]
    end
    subgraph SILVER["🥈 Silver — Curated"]
        SI1[Cleaned & deduplicated\nSchema enforced\nJoins applied\nBusiness keys resolved]
    end
    subgraph GOLD["🥇 Gold — Serving"]
        G1[Aggregated & modeled\nOptimized for consumption\nBI-ready, feature-store-ready]
    end
    S1 & S2 & S3 --> BRONZE --> SILVER --> GOLD
    GOLD --> BI[BI Tools]
    GOLD --> SQL[SQL Analysts]
    GOLD --> ML[ML / AI]

Bronze Layer — Raw Zone¶

Purpose¶

Land data exactly as received from the source. No transformations, no business logic.

Characteristics¶

Schema-on-read or schema-as-arrived — raw JSON, CSV, or database snapshots
Append-only or full table dumps retained with ingestion timestamps
The "time machine" — allows reprocessing from scratch if downstream logic changes
Typically not exposed to business users

What Goes Here¶

Raw CDC event streams
API response payloads (JSON/XML)
File drops (CSV, XML, Excel)
Database table snapshots

SA Talking Points¶

"Do you ever need to reprocess historical data when business rules change?" — if yes, bronze is essential
The cost of bronze storage is low (object storage); the cost of not having it is high (re-extraction from source)
Compliance use case: bronze is the audit trail — raw data is immutable evidence

Silver Layer — Curated Zone¶

Purpose¶

Cleaned, validated, and joined data. Represents trusted business entities — customers, orders, products — not raw source artifacts.

Characteristics¶

Schema enforced, data types validated
Deduplication and null handling applied
Business keys resolved (e.g., customer_id standardized across sources)
Slowly Changing Dimensions (SCD) logic often lives here
Read by data engineers, data scientists, and advanced analysts

What Goes Here¶

Cleansed dimension tables (dim_customer, dim_product)
Fact tables with business keys joined in
Unified customer/product views merged from multiple source systems

SA Talking Points¶

Silver is where "data quality" lives — this is the layer that needs the most governance investment
Ask: "Who is responsible for the definition of a 'customer' in your organization?" — that answer defines who owns silver

Gold Layer — Serving Zone¶

Purpose¶

Aggregated, pre-computed, business-domain-specific datasets optimized for query performance and consumption.

Characteristics¶

Star or snowflake schemas for BI tools
Pre-aggregated metrics and KPIs
Feature tables for ML models
Optimized with Z-ordering, clustering, or partitioning for fast query access
Exposed to business users, BI tools, dashboards

What Goes Here¶

Sales summary tables (daily/weekly/monthly aggregates)
Marketing attribution models
Finance period-end snapshots
ML feature store tables

SA Talking Points¶

Gold is what BI tools and business users actually query — performance here directly impacts analyst satisfaction
"How many aggregation tables do you have?" — sprawl here usually means governance hasn't caught up with demand
Gold tables in Delta/Iceberg are functionally equivalent to materialized views in a warehouse

Open Table Formats¶

Open table formats bring warehouse-grade features (ACID, schema evolution, time travel) to files stored in object storage. They are the technical foundation of the lakehouse.

Delta Lake¶

Developed by Databricks, now open-source under the Linux Foundation.

Feature	What It Does
ACID Transactions	Concurrent reads/writes without corruption
Time Travel	Query any prior version via `VERSION AS OF` or `TIMESTAMP AS OF`
Schema Evolution	Add/rename columns without rewriting the table
Schema Enforcement	Rejects writes that don't match the schema
Z-Order Clustering	Co-locates related data in files for faster filtering
Auto Optimize	Automatically compacts small files
Change Data Feed	Exposes row-level changes — useful for CDC and streaming

Best for: Databricks-native workloads, streaming + batch unified pipelines

Apache Iceberg¶

Originated at Netflix, now an Apache top-level project. Widely supported across engines.

Feature	What It Does
ACID Transactions	Full read/write isolation
Hidden Partitioning	Partition evolution without rewriting data
Row-level Deletes	Efficient deletes without full file rewrites
Multi-engine	Spark, Flink, Trino, Snowflake, BigQuery all read Iceberg natively
Time Travel	Snapshot-based versioning

Best for: Multi-engine environments, customers using Snowflake + Spark + Trino together

Apache Hudi¶

Originated at Uber, optimized for record-level upserts.

Feature	What It Does
Upsert Performance	Optimized for high-frequency record-level updates (CDC)
Incremental Queries	Efficiently read only changed records since last query
ACID	Full transaction support

Best for: High-volume CDC use cases, AWS-centric environments (DeltaStreamer on EMR)

Format Comparison¶

	Delta Lake	Iceberg	Hudi
Best engine	Databricks / Spark	Any engine	Spark + CDC
Multi-engine support	Growing (via Delta Kernel)	Excellent	Moderate
Streaming support	Excellent	Good	Good
Upsert performance	Good	Good	Excellent
Community	Large (Databricks-backed)	Large (Apache)	Moderate

SA Talking Points¶

Customers don't usually pick a format — they inherit one from their platform choice (Databricks → Delta, Snowflake → Iceberg preferred)
The important message is open vs. proprietary — all three formats let customers move to a different compute engine without re-ingesting data
Delta-Iceberg interoperability is improving (UniForm) — less of a "pick one forever" decision than it used to be

Partitioning and Clustering¶

Why It Matters¶

Large tables need physical organization to avoid full table scans. The wrong strategy can make queries 10–100x slower.

Partitioning¶

Divides a table into sub-directories by a column value (e.g., year/month/day). Queries that filter on the partition column skip entire partitions.

Good partition columns: Date/time, region, business unit — low-to-medium cardinality, commonly filtered
Bad partition columns: User ID, transaction ID — too many partitions, small files problem

Z-Ordering (Delta) / Sorting (Iceberg)¶

Co-locates rows with similar values in the same files. Enables efficient skipping within a partition when filtering on non-partition columns.

Best for high-cardinality filter columns (customer_id, product_id)
Complements partitioning — partition by date, Z-order by customer_id

SA Talking Points¶

"What columns does your team filter on most?" — that answer drives the partitioning and clustering strategy
Small files are the most common performance problem in production lakehouses — auto-compaction (Delta) or scheduled compaction (Iceberg) is essential

SA Rule of Thumb: Medallion + Delta Lake covers the vast majority of Databricks-centric architectures. Iceberg is the answer when customers need true multi-engine access or are standardizing on Snowflake as a co-platform.