Skip to content

Introduction to Databricks Architecture on AWS — SA Quick Reference

What It Is

A "Lakehouse" architecture that combines the performance of a data warehouse with the low cost of a data lake. It runs high-performance processing directly on your existing Amazon S3 data without needing to move or duplicate it.

Why Customers Care

  • Eliminate Data Silos: Stop maintaining separate, expensive copies of data for BI and Machine Learning.
  • Reduced Operational Tax: No more complex ETL pipelines just to move data from a "landing" zone to a "reporting" zone.
  • Enhanced Security & Sovereignty: Your actual business data stays within your AWS VPC and S3 buckets, under your direct control.

Key Differentiators vs Alternatives

  • Decoupled Architecture: Separates the "brains" (Databricks Control Plane) from the "muscle" (your AWS Data Plane) for maximum security.
  • Unified Governance: Unity Catalog replaces messy, folder-level S3 permissions with simple, SQL-based object-level access control.
  • Single Source of Truth: Delta Lake adds ACID transactions and "Time Travel" to S3, giving your raw files the reliability of a high-end warehouse.

When to Recommend It

Target customers currently struggling with "architectural tax"—specifically those managing separate AWS Redshift (for BI) and S3/EMR (for ML) environments. It is ideal for organizations migrating from AWS Glue or EMR who are ready to move from fragmented data silos to a unified, governed Lakehouse.

Top 3 Objections & Responses

"Is my data being moved into Databricks' cloud?" → No. Databricks manages the orchestration (Control Plane), but the actual data remains in your S3 buckets within your AWS account.

"We already have Amazon Redshift for our BI workloads." → You can use both, but Databricks allows you to run BI directly on your S3 data, eliminating the cost and complexity of moving data into a separate warehouse.

"Managing permissions in S3 is a headache for us." → That's exactly why we use Unity Catalog; it moves you away from complex IAM/folder-level policies to simple, centralized SQL commands like GRANT SELECT.

5 Things to Know Before the Call

  1. Data Sovereignty: The "Data Plane" (EC2 and S3) lives in the customer's AWS account, not Databricks'.
  2. Cost Optimization: Advise customers to use "Job Clusters" for automated production to significantly lower compute costs compared to "All-Purpose Clusters."
  3. The "Brain" vs. "Muscle": The Control Plane handles the UI and metadata; the Data Plane handles the heavy lifting of Spark processing.
  4. Delta Lake is Key: The magic happens because Delta Lake adds reliability (ACID, schema enforcement) to standard S3 files.
  5. Governance Shift: Moving to Databricks means moving from managing "files and folders" to managing "tables and rows."

Competitive Snapshot

vs Advantage
Amazon Redshift Avoids the "Data Warehouse vs. Data Lake" tax by using one architecture for both BI and ML.
AWS Glue / EMR Provides a unified, user-friendly interface with integrated governance (Unity Catalog) and easier scaling.

Source: Introduction to Databricks Architecture on AWS course section