Skip to content

Data Storage Layers

Reference for SAs discussing data organization, open table formats, and the medallion architecture with customers. Covers what lives where, why, and how to talk about it without getting lost in file format details.


The Medallion Architecture

The most common organizational pattern for lakehouse storage. Data flows through three progressively refined layers:

flowchart LR
    subgraph Sources
        S1[(Operational DBs)]
        S2[Files / APIs]
        S3[Streams]
    end
    subgraph BRONZE["🥉 Bronze — Raw"]
        B1[Exact copy of source\nNo transformation\nAppend-only or full load]
    end
    subgraph SILVER["🥈 Silver — Curated"]
        SI1[Cleaned & deduplicated\nSchema enforced\nJoins applied\nBusiness keys resolved]
    end
    subgraph GOLD["🥇 Gold — Serving"]
        G1[Aggregated & modeled\nOptimized for consumption\nBI-ready, feature-store-ready]
    end
    S1 & S2 & S3 --> BRONZE --> SILVER --> GOLD
    GOLD --> BI[BI Tools]
    GOLD --> SQL[SQL Analysts]
    GOLD --> ML[ML / AI]

Bronze Layer — Raw Zone

Purpose

Land data exactly as received from the source. No transformations, no business logic.

Characteristics

  • Schema-on-read or schema-as-arrived — raw JSON, CSV, or database snapshots
  • Append-only or full table dumps retained with ingestion timestamps
  • The "time machine" — allows reprocessing from scratch if downstream logic changes
  • Typically not exposed to business users

What Goes Here

  • Raw CDC event streams
  • API response payloads (JSON/XML)
  • File drops (CSV, XML, Excel)
  • Database table snapshots

SA Talking Points

  • "Do you ever need to reprocess historical data when business rules change?" — if yes, bronze is essential
  • The cost of bronze storage is low (object storage); the cost of not having it is high (re-extraction from source)
  • Compliance use case: bronze is the audit trail — raw data is immutable evidence

Silver Layer — Curated Zone

Purpose

Cleaned, validated, and joined data. Represents trusted business entities — customers, orders, products — not raw source artifacts.

Characteristics

  • Schema enforced, data types validated
  • Deduplication and null handling applied
  • Business keys resolved (e.g., customer_id standardized across sources)
  • Slowly Changing Dimensions (SCD) logic often lives here
  • Read by data engineers, data scientists, and advanced analysts

What Goes Here

  • Cleansed dimension tables (dim_customer, dim_product)
  • Fact tables with business keys joined in
  • Unified customer/product views merged from multiple source systems

SA Talking Points

  • Silver is where "data quality" lives — this is the layer that needs the most governance investment
  • Ask: "Who is responsible for the definition of a 'customer' in your organization?" — that answer defines who owns silver

Gold Layer — Serving Zone

Purpose

Aggregated, pre-computed, business-domain-specific datasets optimized for query performance and consumption.

Characteristics

  • Star or snowflake schemas for BI tools
  • Pre-aggregated metrics and KPIs
  • Feature tables for ML models
  • Optimized with Z-ordering, clustering, or partitioning for fast query access
  • Exposed to business users, BI tools, dashboards

What Goes Here

  • Sales summary tables (daily/weekly/monthly aggregates)
  • Marketing attribution models
  • Finance period-end snapshots
  • ML feature store tables

SA Talking Points

  • Gold is what BI tools and business users actually query — performance here directly impacts analyst satisfaction
  • "How many aggregation tables do you have?" — sprawl here usually means governance hasn't caught up with demand
  • Gold tables in Delta/Iceberg are functionally equivalent to materialized views in a warehouse

Open Table Formats

Open table formats bring warehouse-grade features (ACID, schema evolution, time travel) to files stored in object storage. They are the technical foundation of the lakehouse.

Delta Lake

Developed by Databricks, now open-source under the Linux Foundation.

Feature What It Does
ACID Transactions Concurrent reads/writes without corruption
Time Travel Query any prior version via VERSION AS OF or TIMESTAMP AS OF
Schema Evolution Add/rename columns without rewriting the table
Schema Enforcement Rejects writes that don't match the schema
Z-Order Clustering Co-locates related data in files for faster filtering
Auto Optimize Automatically compacts small files
Change Data Feed Exposes row-level changes — useful for CDC and streaming

Best for: Databricks-native workloads, streaming + batch unified pipelines

Apache Iceberg

Originated at Netflix, now an Apache top-level project. Widely supported across engines.

Feature What It Does
ACID Transactions Full read/write isolation
Hidden Partitioning Partition evolution without rewriting data
Row-level Deletes Efficient deletes without full file rewrites
Multi-engine Spark, Flink, Trino, Snowflake, BigQuery all read Iceberg natively
Time Travel Snapshot-based versioning

Best for: Multi-engine environments, customers using Snowflake + Spark + Trino together

Apache Hudi

Originated at Uber, optimized for record-level upserts.

Feature What It Does
Upsert Performance Optimized for high-frequency record-level updates (CDC)
Incremental Queries Efficiently read only changed records since last query
ACID Full transaction support

Best for: High-volume CDC use cases, AWS-centric environments (DeltaStreamer on EMR)

Format Comparison

Delta Lake Iceberg Hudi
Best engine Databricks / Spark Any engine Spark + CDC
Multi-engine support Growing (via Delta Kernel) Excellent Moderate
Streaming support Excellent Good Good
Upsert performance Good Good Excellent
Community Large (Databricks-backed) Large (Apache) Moderate

SA Talking Points

  • Customers don't usually pick a format — they inherit one from their platform choice (Databricks → Delta, Snowflake → Iceberg preferred)
  • The important message is open vs. proprietary — all three formats let customers move to a different compute engine without re-ingesting data
  • Delta-Iceberg interoperability is improving (UniForm) — less of a "pick one forever" decision than it used to be

Partitioning and Clustering

Why It Matters

Large tables need physical organization to avoid full table scans. The wrong strategy can make queries 10–100x slower.

Partitioning

Divides a table into sub-directories by a column value (e.g., year/month/day). Queries that filter on the partition column skip entire partitions.

  • Good partition columns: Date/time, region, business unit — low-to-medium cardinality, commonly filtered
  • Bad partition columns: User ID, transaction ID — too many partitions, small files problem

Z-Ordering (Delta) / Sorting (Iceberg)

Co-locates rows with similar values in the same files. Enables efficient skipping within a partition when filtering on non-partition columns.

  • Best for high-cardinality filter columns (customer_id, product_id)
  • Complements partitioning — partition by date, Z-order by customer_id

SA Talking Points

  • "What columns does your team filter on most?" — that answer drives the partitioning and clustering strategy
  • Small files are the most common performance problem in production lakehouses — auto-compaction (Delta) or scheduled compaction (Iceberg) is essential

SA Rule of Thumb: Medallion + Delta Lake covers the vast majority of Databricks-centric architectures. Iceberg is the answer when customers need true multi-engine access or are standardizing on Snowflake as a co-platform.