1. Snowflake Architecture and Core Components

Snowflake Architecture and Core Components¶

Section at a Glance¶

What you'll learn: - The fundamental "Multi-cluster Shared Data" architecture of Snowflake. - How the separation of Storage, Compute, and Cloud Services drives business value. - The mechanics of Micro-partitions and their impact on query performance. - The difference between Scaling Up (Vertical) and Scaling Out (Horizontal). - How to architect for cost-efficiency and workload isolation.

Key terms: Multi-cluster Shared Data · Micro-partitions · Virtual Warehouse · Cloud Services · Scaling Up · Scaling Out

TL;DR: Snowflake uses a unique architecture that decouples storage from compute, allowing you to scale processing power independently of data volume, ensuring that heavy ETL processes never compete with critical BI reporting.

Overview¶

In traditional Data Warehouse architectures (like legacy on-premise appliances or early-generation cloud MPP), compute and storage were tightly coupled. If you needed more processing power to handle a heavy end-of-month report, you often had to add more nodes, which meant also paying for more storage you didn't need. This "monolithic" scaling creates a massive business bottleneck: you are forced to over-provision for peak loads, leading to wasted capital.

Snowflake solves this via a Multi-cluster Shared Data architecture. By decoupling these layers, Snowflake allows a business to scale storage infinitely and cheaply on cloud object storage (like AWS S3), while spinning up or down "Virtual Warehouses" (compute) only when needed.

For a Data Engineer, this means the architecture is no longer a constraint to design around. For the business, it means cost transparency and the ability to provide "Performance on Demand." This section provides the foundational knowledge required to understand how Snowflake manages data persistence, query execution, and metadata-driven optimization.

Core Concepts¶

1. The Three Layers¶

Snowflake’s architecture is strictly divided into three distinct layers:

Database Storage (The Foundation): When data is loaded into Snowflake, it is reorganized into a proprietary, optimized, columnar format. This layer is not just a "dump" of files; it is a highly structured, compressed, and immutable collection of Micro-partitions. > 📌 Must Know: Micro-partitions are the fundamental unit of storage in Snowflake. They are immutable, meaning they cannot be changed; instead, updates create new versions of partitions. This is the "secret sauce" behind Snowflake's Time Travel and Zero-Copy Cloning features.
Query Processing (The Muscle): This is where the Virtual Warehouses live. These are clusters of compute resources. Each warehouse is independent. You can have one warehouse dedicated to "Data Ingestion" and another for "Executive Dashboards." > ⚠️ Warning: Because warehouses are independent, they do not share compute resources. If your Ingestion warehouse is running at 100% CPU, your BI warehouse will remain completely unaffected. However, if both warehouses attempt to write to the same table simultaneously, Snowflake's Cloud Services layer manages the concurrency/locking.
Cloud Services (The Brain): This layer coordinates the entire system. It handles authentication, security, metadata management, query parsing, and the "Query Optimizer." It is the "traffic cop" that ensures every query knows exactly which micro-partitions to access.

2. Micro-partitions and Metadata¶

Unlike traditional databases that use indexes (which require manual maintenance), Snowflake uses Micro-partitions. * Columnar Format: Data is stored by column, allowing the engine to skip columns not needed for a query (Columnar Pruning). * Metadata-Driven: The Cloud Services layer maintains metadata about the range of values (Min/Max) within every micro-partition. > 💡 Tip: When a query includes a WHERE clause, Snowflake uses the metadata to "prune" (ignore) partitions that don't contain the relevant data. This is why Snowflake is incredibly fast even without manual indexing.

Architecture / How It Works¶

graph TD
    subgraph "Cloud Services Layer (The Brain)"
        CS[Metadata / Security / Query Optimization]
    end

    subgraph "Query Processing Layer (The Muscle)"
        VW1[Virtual Warehouse: ETL]
        VW2[Virtual Warehouse: BI/Reporting]
/        VW3[Virtual Warehouse: Data Science]
    end

    subgraph "Database Storage Layer (The Foundation)"
        S3[(Cloud Object Storage: S3/Azure/GCS)]
        MP[Micro-partitions: Columnar & Compressed]
    end

    VW1 --> CS
    VW2 --> CS
    VW3 --> CS
    VW1 --> S3
    VW2 --> S3
    VW3 --> S3
    S3 --- MP

Cloud Services Layer: Processes the SQL, checks permissions, and consults metadata to identify which files to read.
Virtual Warehouses: Execute the actual heavy lifting (joins, aggregations, scans) using the compute resources provided.
Database Storage: Holds the physical, compressed micro-partitions in highly durable cloud object storage.

Comparison: When to Use What¶

Option	Best For	Trade-offs	Approx. Cost Signal
Scaling Up (Resize)	Complex queries with massive joins/shuffles.	Does not increase concurrency; only helps single-query speed.	Increases cost per second (higher T-shirt size).
Scaling Out (Multi-cluster)	High concurrency (many users running small queries).	Does not make a single query faster; prevents queuing.	Increases cost only when additional clusters are active.
Single Warehouse	Development, testing, or low-priority ad-hoc tasks.	Subject to queuing if multiple users hit it at once.	Lowest cost (minimal resource consumption).

How to choose: If your query is "running slow" (long duration), Scale Up. If your users are "waiting for a query to start" (high queuing), Scale Out.

Cost Cheat Sheet¶

Scenario	Recommended Option	Key Cost Driver	Watch Out For
Continuous Data Ingestion	Small Warehouse (X-Small) with high frequency.	Warehouse uptime (Minutes).	💰 Cost Note: Forgetting to set `AUTO_SUSPEND` can leave a warehouse running 24/7, draining credits.
Large-scale Daily ETL	Large/X-Large Warehouse.	Compute duration and Warehouse Size.	Using a massive warehouse for a tiny dataset; you pay for the "size" even if it finishes in 1 minute.
Concurrent BI Users	Multi-cluster Warehouse (Auto-scale).	Number of active clusters.	Setting the `MAX_CLUSTER_COUNT` too high, causing "cluster sprawl" during peak hours.
Data Science/ML	Large Warehouse (for memory-intensive joins).	Memory/Compute availability.	Long-running queries that prevent the warehouse from auto-suspending.

💰 Cost Note: The single biggest cost mistake is failing to configure the Auto-Suspend setting. In Snowflake, you are billed by the second (with a 1-minute minimum). A warehouse left "idle" but active can incur thousands of dollars in unnecessary costs over a month.

Service & Tool Integrations¶

Cloud Object Stores (S3/Azure/GCS):
- Used as the "External Stage" for data ingestion.
- Snowflake acts as a compute engine that reads directly from these buckets.
BI Tools (Tableau, PowerBI, Looker):
- Connect via standard ODBC/JDBC drivers.
- The integration pattern relies on a dedicated "BI Warehouse" to ensure dashboard refreshes don't impact ETL.
Data Pipelines (dbt, Airflow):
- Orchestrate the execution of SQL transformations.
- Pattern: Airflow triggers a Snowflake task $\rightarrow$ Snowflake Warehouse spins up $\rightarrow$ SQL runs $\rightarrow$ Warehouse auto-suspends.

Security Considerations¶

Snowflake provides "Security by Default." Encryption is always on, and the Cloud Services layer manages the keys.

Control	Default State	How to Enable / Strengthen
Encryption at Rest	Always Enabled	Managed by Snowflake (AES-25t); no user action required.
Encryption in Transit	Always Enabled	Uses TLS 1.2+ for all connections.
Network Isolation	Open to Snowflake endpoints	Use Network Policies to restrict access to specific IP ranges or VPCs.
Access Control	Role-Based (RBAC)	Implement the principle of least privilege by creating custom functional roles.

Performance & Cost¶

The Tuning Trade-off: In Snowflake, performance tuning is often a trade-off between Latency and Credit Consumption.

Example Scenario:
- Task: A massive monthly aggregation of 10TB of sales data.
- Option A (Small Warehouse): Takes 10 hours. Cost: 10 hours $\times$ 1 credit/hr = 10 Credits.
- Option B (Large Warehouse): Takes 30 minutes. Cost: 0.5 hours $\times$ 8 credits/hr = 4 Credits.
- Result: In this case, scaling up actually saved the company money because the work finished much faster, reducing the total "warehouse uptime."

Common Bottlenecks: * Data Spilling: When a warehouse runs out of RAM, it "spills" to local disk (fast) or remote storage (very slow). This is a signal you need to Scale Up. * Queueing: When many queries hit one warehouse and exceed its capacity. This is a signal you need to Scale Out.

Hands-On: Key Operations¶

First, let's create a warehouse specifically for our ingestion workload.

-- Create a dedicated warehouse for ETL with auto-suspend enabled to save costs.
CREATE WAREHOUSE ETL_WH 
WITH WAREHOUSE_SIZE = 'XSMALL' 
AUTO_SUSPEND = 60 
AUTO_RESUME = TRUE 
INITIALLY_SUSPENDED = TRUE;

💡 Tip: Always set INITIALLY_SUSPENDED = TRUE for new warehouses used for specialized tasks to ensure you aren't burning credits immediately upon creation.

Next, let's see how we scale a warehouse vertically to handle a larger workload.

-- Scaling UP: Increasing the size of the warehouse to handle a heavy transformation.
ALTER WAREHOUSE ETL_WH SET WAREHOUSE_SIZE = 'LARGE';

Finally, let's check the status of our warehouses to verify the change.

-- Retrieve the current state of all warehouses in the account.
SHOW WAREHOUSES;

Customer Conversation Angles¶

Q: "We have a huge ETL job and a heavy BI dashboard. If we run them both at the same time, will the ETL job slow down our executives' dashboards?" A: Not if you architect it correctly. You should use separate Virtual Warehouses for ETL and BI; since compute is decoupled, they will run in complete isolation.

Q: "Is Snowflake's storage more expensive than just leaving files in an S3 bucket?" A: While Snowflake storage has a small markup for management, the cost is largely the same as S3. The real value is that the data is stored in a highly compressed, optimized format that significantly reduces the compute costs needed to query it.

Q: "If I scale my warehouse to a 'Large' size, will my existing queries be interrupted?" A: No. When you scale up, Snowflake creates a new cluster of the new size. Existing queries continue on the old cluster, and new queries are routed to the new, larger cluster.

Q: "How do I prevent my team from accidentally running up a massive bill with huge warehouses?"
A: You can implement Resource Monitors at the account or warehouse level to automatically send alerts or even shut down warehouses when they hit a pre-defined credit quota.

Q: "We have many users running small queries. Should we use a larger warehouse or more clusters?"
A: You should use Multi-cluster scaling (Scaling Out). A larger warehouse helps one big query run faster, but more clusters allow more users to run queries simultaneously without waiting in a queue.

Common FAQs and Misconceptions¶

Q: Does Snowflake use indexes like SQL Server or Oracle? A: No. Snowflake uses micro-partition metadata and columnar pruning.

⚠️ Warning: Thinking you need to "create indexes" is a common pitfall that leads to wasted effort; focus instead on cluster keys and partition pruning.

Q: Can I delete data and immediately reclaim the storage cost? A: Not immediately. Because of "Time Travel," the data persists for the duration of your retention period (default 1 day, up to 90).

Q: Is Snowflake a 'Serverless' database? A: It is "Serverless-style" in management (no servers to patch), but you are still managing "Virtual Warehouses" (compute resources).

Q: Does scaling a warehouse change the underlying data? A: No. The storage layer remains untouched; only the compute power accessing it changes.

Q: Can I use Snowflake for unstructured data like PDFs or Images? A: Yes. Snowflake can store and manage metadata for unstructured data, though the "processing" happens via specialized functions or external integration.

Q: If I stop my warehouse, does my data disappear? A: Absolutely not. The data lives in the permanent Storage Layer, which is independent of the compute layer.

Exam & Certification Focus¶

The Three Layers: You must be able to identify which component handles which task (e.g., "Which layer handles query optimization?" $\rightarrow$ Cloud Services). 📌 High Frequency
Scaling Up vs. Scaling Out: Expect questions distinguishing between vertical scaling (size) and horizontal scaling (multi-cluster). 📌 High Frequency
Micro-partitioning: Understand the concept of immutability and metadata-driven pruning.
Warehouse Management: Knowledge of AUTO_SUSPEND and AUTO_RESUME is critical for both exam and real-world cost management.
Multi-cluster Shared Data: The fundamental definition of the architecture is a core exam topic.

Quick Recap¶

Snowflake utilizes a Multi-cluster Shared Data architecture.
Storage, Compute, and Cloud Services are completely decoupled.
Micro-partitions provide high performance through columnar storage and metadata pruning.
Scaling Up (Vertical) improves single-query performance; Scaling Out (Horizontal) improves concurrency.
Cost Management depends heavily on managing Warehouse uptime via AUTO_SUSPEND.