Skip to content

Data Architecture Patterns

A SA reference for understanding and positioning the major data platform architectural patterns. Focus is on helping customers choose the right pattern for their maturity level and workload mix — not on implementation details.


The Landscape at a Glance

graph TD
    subgraph Patterns
        DW[Data Warehouse]
        DL[Data Lake]
        LH[Lakehouse]
        DM[Data Mesh]
        DF[Data Fabric]
    end
    DW -->|"evolved to"| LH
    DL -->|"evolved to"| LH
    LH -->|"scaled with domain ownership"| DM
    DM & LH -->|"unified with metadata + AI"| DF

Data Warehouse

What It Is

A structured, schema-on-write repository purpose-built for SQL analytics. Data is cleaned, modeled (star/snowflake schema), and stored in a columnar format optimized for aggregation queries.

Key players: Snowflake, Google BigQuery, Amazon Redshift, Azure Synapse Analytics, Teradata

Why Customers Use It

  • Predictable query performance with low operational overhead
  • Mature BI tooling integration (Tableau, Power BI, etc.)
  • Strong governance and access control primitives
  • ACID compliance out of the box

Limitations

  • Poor fit for unstructured data (logs, images, documents)
  • Schema changes are expensive and slow
  • ML/AI workloads require data to be copied out to separate systems
  • Storage and compute costs scale together (less flexible)

SA Talking Points

  • Best for customers with mature, stable data models and heavy SQL BI workloads
  • Ask: "How much of your data is structured vs. semi-structured?" — high unstructured volume is a red flag for pure warehouse
  • "What happens when your data scientist needs the raw data?" — siloed raw stores signal warehouse limitations

Data Lake

What It Is

A centralized repository that stores data in its raw, native format (files on object storage — S3, ADLS, GCS). Schema is applied at read time (schema-on-read), enabling flexibility at ingestion but complexity at consumption.

Key players: AWS S3 + Glue, Azure Data Lake Storage, Google Cloud Storage, Hadoop HDFS (legacy)

Why Customers Build It

  • Cheap, elastic storage for all data types
  • Raw data retained for reprocessing as needs evolve
  • Single source of truth for data science and ML workloads

Limitations

  • "Data swamp" problem — without governance, becomes unusable
  • No ACID transactions — data consistency is the consumer's problem
  • Poor BI performance without a serving layer on top
  • Maintenance burden: small files problem, compaction, schema drift

SA Talking Points

  • Most customers have a data lake whether they planned one or not (S3/ADLS buckets accumulating data)
  • The question is not "should we build a lake" but "how do we make what we have usable"
  • A lake without a catalog is just a file system with a cloud bill

Lakehouse

What It Is

An architecture that combines the low-cost, flexible storage of a data lake with the ACID transactions, schema enforcement, and performance optimizations of a data warehouse — typically built on open table formats (Delta Lake, Apache Iceberg, Apache Hudi) sitting on top of object storage.

Key players: Databricks (Delta Lake), Apache Iceberg on any cloud, Snowflake (Iceberg external tables)

Architecture

flowchart LR
    subgraph Ingestion
        CDC[CDC / Streaming]
        BATCH[Batch / ELT]
    end
    subgraph Storage["Open Table Storage (S3 / ADLS / GCS)"]
        BRONZE[Bronze\nRaw]
        SILVER[Silver\nCleansed]
        GOLD[Gold\nAggregated]
    end
    subgraph Serving
        SQL[SQL Analytics\nDatabricks SQL / BigQuery]
        ML[ML / AI\nNotebooks / MLflow]
        BI[BI Tools\nTableau / Power BI]
    end
    CDC & BATCH --> BRONZE --> SILVER --> GOLD
    GOLD --> SQL & ML & BI

Why It Matters

  • Single copy of data — no need to ETL data out to a separate warehouse for BI and a separate data lake for ML
  • Open formats — data is not locked to a vendor; multiple engines can read the same table
  • ACID on object storage — time travel, schema evolution, and concurrent writes without the warehouse price tag

SA Talking Points

  • Position as "the warehouse and the lake, unified" — resonates with customers tired of managing two systems
  • Open formats (Delta/Iceberg) are the key differentiator vs. proprietary warehouses — data is always accessible
  • Ask: "Do your data scientists work on the same data your BI team uses, or do they have their own copy?" — two copies = lakehouse conversation

Data Mesh

What It Is

A sociotechnical architecture — not a technology — that decentralizes data ownership to domain teams. Each domain (e.g. Sales, Finance, Logistics) owns, produces, and publishes its data as a product. A central platform team provides the infrastructure (the "data platform as a product").

Four principles (Zhamak Dehghani): 1. Domain-oriented ownership 2. Data as a product 3. Self-serve data infrastructure 4. Federated computational governance

When It Makes Sense

  • Large organizations with many distinct business domains
  • Central data team is a bottleneck — SLAs are broken, backlogs are months long
  • Domains have the engineering capacity to own their data pipelines

Limitations

  • High organizational maturity required — fails without domain buy-in
  • Governance is hard: federated means different teams define quality differently
  • Not a technology you buy — it is an operating model change

SA Talking Points

  • Data mesh is often misunderstood as a product — it is an org model
  • Ask: "Is your central data team a bottleneck?" — if yes, explore mesh principles even if a full mesh isn't the answer
  • Databricks Unity Catalog enables federated governance, which is the technical foundation for mesh — but the culture has to come first

Data Fabric

What It Is

A design concept where metadata, AI, and automation are used to create a unified, intelligent data management layer across disparate systems — regardless of where data lives. Think of it as a mesh of existing systems connected by an active metadata graph.

Key players: Informatica, IBM, Talend, Microsoft Fabric

How It Differs from Data Mesh

Data Mesh Data Fabric
Focus Organizational/social Technology/automation
Ownership Decentralized to domains Varies — often centralized
Glue Operating model Active metadata + AI
Buy vs. build Mostly build Often buy (platform vendors)

SA Talking Points

  • Fabric is appealing to customers who want governance and discoverability without reorganizing teams
  • Often positioned by incumbent vendors (IBM, Informatica) as an upgrade path
  • Watch for vendor-specific "data fabric" definitions — it is a marketing term as much as an architecture term

Choosing the Right Pattern

Customer Signal Likely Pattern
Heavy SQL BI, stable schema, limited ML Data Warehouse
Lots of raw/unstructured data, ML-first Data Lake
Mix of BI + ML, cloud-native, cost-conscious Lakehouse
Large org, central team is a bottleneck, domain autonomy needed Data Mesh
Heterogeneous environments, existing systems to connect, metadata-driven Data Fabric

SA Rule of Thumb: Start with the lakehouse — it covers 80% of enterprise use cases. Only introduce mesh or fabric when the organizational or integration complexity clearly demands it.