Introduction to Databricks Architecture on AWS — SA Quick Reference¶
What It Is¶
A "Lakehouse" architecture that combines the performance of a data warehouse with the low cost of a data lake. It runs high-performance processing directly on your existing Amazon S3 data without needing to move or duplicate it.
Why Customers Care¶
- Eliminate Data Silos: Stop maintaining separate, expensive copies of data for BI and Machine Learning.
- Reduced Operational Tax: No more complex ETL pipelines just to move data from a "landing" zone to a "reporting" zone.
- Enhanced Security & Sovereignty: Your actual business data stays within your AWS VPC and S3 buckets, under your direct control.
Key Differentiators vs Alternatives¶
- Decoupled Architecture: Separates the "brains" (Databricks Control Plane) from the "muscle" (your AWS Data Plane) for maximum security.
- Unified Governance: Unity Catalog replaces messy, folder-level S3 permissions with simple, SQL-based object-level access control.
- Single Source of Truth: Delta Lake adds ACID transactions and "Time Travel" to S3, giving your raw files the reliability of a high-end warehouse.
When to Recommend It¶
Target customers currently struggling with "architectural tax"—specifically those managing separate AWS Redshift (for BI) and S3/EMR (for ML) environments. It is ideal for organizations migrating from AWS Glue or EMR who are ready to move from fragmented data silos to a unified, governed Lakehouse.
Top 3 Objections & Responses¶
"Is my data being moved into Databricks' cloud?" → No. Databricks manages the orchestration (Control Plane), but the actual data remains in your S3 buckets within your AWS account.
"We already have Amazon Redshift for our BI workloads." → You can use both, but Databricks allows you to run BI directly on your S3 data, eliminating the cost and complexity of moving data into a separate warehouse.
"Managing permissions in S3 is a headache for us."
→ That's exactly why we use Unity Catalog; it moves you away from complex IAM/folder-level policies to simple, centralized SQL commands like GRANT SELECT.
5 Things to Know Before the Call¶
- Data Sovereignty: The "Data Plane" (EC2 and S3) lives in the customer's AWS account, not Databricks'.
- Cost Optimization: Advise customers to use "Job Clusters" for automated production to significantly lower compute costs compared to "All-Purpose Clusters."
- The "Brain" vs. "Muscle": The Control Plane handles the UI and metadata; the Data Plane handles the heavy lifting of Spark processing.
- Delta Lake is Key: The magic happens because Delta Lake adds reliability (ACID, schema enforcement) to standard S3 files.
- Governance Shift: Moving to Databricks means moving from managing "files and folders" to managing "tables and rows."
Competitive Snapshot¶
| vs | Advantage |
|---|---|
| Amazon Redshift | Avoids the "Data Warehouse vs. Data Lake" tax by using one architecture for both BI and ML. |
| AWS Glue / EMR | Provides a unified, user-friendly interface with integrated governance (Unity Catalog) and easier scaling. |
Source: Introduction to Databricks Architecture on AWS course section