Data Architecture Patterns¶

A SA reference for understanding and positioning the major data platform architectural patterns. Focus is on helping customers choose the right pattern for their maturity level and workload mix — not on implementation details.

The Landscape at a Glance¶

graph TD
    subgraph Patterns
        DW[Data Warehouse]
        DL[Data Lake]
        LH[Lakehouse]
        DM[Data Mesh]
        DF[Data Fabric]
    end
    DW -->|"evolved to"| LH
    DL -->|"evolved to"| LH
    LH -->|"scaled with domain ownership"| DM
    DM & LH -->|"unified with metadata + AI"| DF

Data Warehouse¶

What It Is¶

A structured, schema-on-write repository purpose-built for SQL analytics. Data is cleaned, modeled (star/snowflake schema), and stored in a columnar format optimized for aggregation queries.

Key players: Snowflake, Google BigQuery, Amazon Redshift, Azure Synapse Analytics, Teradata

Why Customers Use It¶

Predictable query performance with low operational overhead
Mature BI tooling integration (Tableau, Power BI, etc.)
Strong governance and access control primitives
ACID compliance out of the box

Limitations¶

Poor fit for unstructured data (logs, images, documents)
Schema changes are expensive and slow
ML/AI workloads require data to be copied out to separate systems
Storage and compute costs scale together (less flexible)

SA Talking Points¶

Best for customers with mature, stable data models and heavy SQL BI workloads
Ask: "How much of your data is structured vs. semi-structured?" — high unstructured volume is a red flag for pure warehouse
"What happens when your data scientist needs the raw data?" — siloed raw stores signal warehouse limitations

Data Lake¶

What It Is¶

A centralized repository that stores data in its raw, native format (files on object storage — S3, ADLS, GCS). Schema is applied at read time (schema-on-read), enabling flexibility at ingestion but complexity at consumption.

Key players: AWS S3 + Glue, Azure Data Lake Storage, Google Cloud Storage, Hadoop HDFS (legacy)

Why Customers Build It¶

Cheap, elastic storage for all data types
Raw data retained for reprocessing as needs evolve
Single source of truth for data science and ML workloads

Limitations¶

"Data swamp" problem — without governance, becomes unusable
No ACID transactions — data consistency is the consumer's problem
Poor BI performance without a serving layer on top
Maintenance burden: small files problem, compaction, schema drift

SA Talking Points¶

Most customers have a data lake whether they planned one or not (S3/ADLS buckets accumulating data)
The question is not "should we build a lake" but "how do we make what we have usable"
A lake without a catalog is just a file system with a cloud bill

Lakehouse¶

What It Is¶

An architecture that combines the low-cost, flexible storage of a data lake with the ACID transactions, schema enforcement, and performance optimizations of a data warehouse — typically built on open table formats (Delta Lake, Apache Iceberg, Apache Hudi) sitting on top of object storage.

Key players: Databricks (Delta Lake), Apache Iceberg on any cloud, Snowflake (Iceberg external tables)

Architecture¶

flowchart LR
    subgraph Ingestion
        CDC[CDC / Streaming]
        BATCH[Batch / ELT]
    end
    subgraph Storage["Open Table Storage (S3 / ADLS / GCS)"]
        BRONZE[Bronze\nRaw]
        SILVER[Silver\nCleansed]
        GOLD[Gold\nAggregated]
    end
    subgraph Serving
        SQL[SQL Analytics\nDatabricks SQL / BigQuery]
        ML[ML / AI\nNotebooks / MLflow]
        BI[BI Tools\nTableau / Power BI]
    end
    CDC & BATCH --> BRONZE --> SILVER --> GOLD
    GOLD --> SQL & ML & BI

Why It Matters¶

Single copy of data — no need to ETL data out to a separate warehouse for BI and a separate data lake for ML
Open formats — data is not locked to a vendor; multiple engines can read the same table
ACID on object storage — time travel, schema evolution, and concurrent writes without the warehouse price tag

SA Talking Points¶

Position as "the warehouse and the lake, unified" — resonates with customers tired of managing two systems
Open formats (Delta/Iceberg) are the key differentiator vs. proprietary warehouses — data is always accessible
Ask: "Do your data scientists work on the same data your BI team uses, or do they have their own copy?" — two copies = lakehouse conversation

Data Mesh¶

What It Is¶

A sociotechnical architecture — not a technology — that decentralizes data ownership to domain teams. Each domain (e.g. Sales, Finance, Logistics) owns, produces, and publishes its data as a product. A central platform team provides the infrastructure (the "data platform as a product").

Four principles (Zhamak Dehghani): 1. Domain-oriented ownership 2. Data as a product 3. Self-serve data infrastructure 4. Federated computational governance

When It Makes Sense¶

Large organizations with many distinct business domains
Central data team is a bottleneck — SLAs are broken, backlogs are months long
Domains have the engineering capacity to own their data pipelines

Limitations¶

High organizational maturity required — fails without domain buy-in
Governance is hard: federated means different teams define quality differently
Not a technology you buy — it is an operating model change

SA Talking Points¶

Data mesh is often misunderstood as a product — it is an org model
Ask: "Is your central data team a bottleneck?" — if yes, explore mesh principles even if a full mesh isn't the answer
Databricks Unity Catalog enables federated governance, which is the technical foundation for mesh — but the culture has to come first

Data Fabric¶

What It Is¶

A design concept where metadata, AI, and automation are used to create a unified, intelligent data management layer across disparate systems — regardless of where data lives. Think of it as a mesh of existing systems connected by an active metadata graph.

Key players: Informatica, IBM, Talend, Microsoft Fabric

How It Differs from Data Mesh¶

	Data Mesh	Data Fabric
Focus	Organizational/social	Technology/automation
Ownership	Decentralized to domains	Varies — often centralized
Glue	Operating model	Active metadata + AI
Buy vs. build	Mostly build	Often buy (platform vendors)

SA Talking Points¶

Fabric is appealing to customers who want governance and discoverability without reorganizing teams
Often positioned by incumbent vendors (IBM, Informatica) as an upgrade path
Watch for vendor-specific "data fabric" definitions — it is a marketing term as much as an architecture term

Choosing the Right Pattern¶

Customer Signal	Likely Pattern
Heavy SQL BI, stable schema, limited ML	Data Warehouse
Lots of raw/unstructured data, ML-first	Data Lake
Mix of BI + ML, cloud-native, cost-conscious	Lakehouse
Large org, central team is a bottleneck, domain autonomy needed	Data Mesh
Heterogeneous environments, existing systems to connect, metadata-driven	Data Fabric

SA Rule of Thumb: Start with the lakehouse — it covers 80% of enterprise use cases. Only introduce mesh or fabric when the organizational or integration complexity clearly demands it.