Data Governance & Cataloging¶
Reference for SAs discussing data governance, metadata management, and access control with customers. Focuses on the "why it matters" and how to position governance as a business enabler, not a compliance checkbox.
Why Governance Matters (The SA Pitch)¶
Without governance, data platforms fail in predictable ways: - Analysts can't find data — they don't know what exists or if it's trustworthy - Security teams can't audit access — who can see PII and who actually did - Compliance teams can't demonstrate lineage — where did this number come from? - Platform costs balloon — duplicate datasets because teams don't know what already exists
The business impact: Governance is not about restricting data — it is about making data usable at scale. The larger the organization, the more critical this becomes.
Governance Architecture¶
flowchart TD
subgraph Catalog["Metadata Catalog"]
DISC[Discovery\nSearch & Browse]
LINE[Lineage\nSource → Transformation → Consumer]
QUAL[Data Quality\nRules & Scores]
GLOSSARY[Business Glossary\nDefinitions & Owners]
end
subgraph Access["Access Control"]
RBAC[Role-Based Access Control]
ABAC[Attribute-Based Access Control]
RLS[Row & Column Level Security]
MASKING[Dynamic Data Masking]
end
subgraph Compliance["Compliance & Audit"]
AUDIT[Query Audit Logs]
LINEAGE[Data Lineage for Compliance]
PII[PII Classification & Tagging]
end
DATA[(Data Assets\nTables / Files / Models)] --> Catalog
Catalog --> Access
Access --> USERS[Users / Tools / APIs]
Catalog & Access --> Compliance
Unity Catalog (Databricks)¶
What It Is¶
Databricks' unified governance layer for data and AI assets. Provides a single control plane for access control, auditing, lineage, and discovery across all Databricks workspaces.
Three-Level Namespace¶
catalog.schema.table
│ │ └── Table, view, or volume
│ └── Schema (database equivalent)
└── Catalog (top-level container — often one per environment or business unit)
Key Capabilities¶
| Capability | What It Does |
|---|---|
| Unified access control | One policy for tables, views, files (volumes), models, and functions |
| Column-level security | Grant/revoke access to specific columns |
| Row filters | Apply SQL predicates to limit visible rows per user/group |
| Dynamic data masking | Mask PII values (e.g., show ***-**-1234 for SSN) for non-privileged users |
| Automated lineage | Tracks column-level lineage across notebooks, workflows, and SQL queries without instrumentation |
| Data discovery | Search across all catalogs by name, tag, or business glossary term |
| Audit logs | Full query and access audit log delivered to cloud storage |
| Delta Sharing | Securely share live Delta tables across organizations without data movement |
SA Talking Points¶
- Unity Catalog replaces the legacy per-workspace Hive Metastore — one catalog for all workspaces is the key message
- "How do you currently control who can see PII data?" — the answer reveals governance maturity
- Automated lineage without any code changes is a strong differentiator — most competitors require manual lineage instrumentation
- Delta Sharing enables B2B data sharing without building custom APIs or copying files
Data Cataloging¶
What a Catalog Does¶
A data catalog is the search engine and encyclopedia of your data platform. It makes data findable, understandable, and trustworthy.
Core Catalog Features¶
| Feature | Purpose |
|---|---|
| Search & Discovery | Find tables, files, dashboards by keyword or tag |
| Technical Metadata | Schema, data types, row count, size, last updated |
| Business Metadata | Descriptions, owners, data domain, sensitivity tags |
| Business Glossary | Canonical definitions of business terms (what is "active customer"?) |
| Data Lineage | Where did this table come from? What consumes it? |
| Data Quality | Quality scores, freshness metrics, rule violations |
| Collaboration | Comments, questions, ratings on data assets |
Key Catalog Tools¶
| Tool | Notes |
|---|---|
| Unity Catalog (Databricks) | Integrated with Databricks platform |
| Microsoft Purview | Azure-native, integrates across Azure services + Power BI |
| Google Dataplex | GCP-native, integrates with BigQuery |
| AWS Glue Data Catalog | AWS-native, used by Athena, EMR, Glue |
| Alation | Best-of-breed third-party, heavy on collaboration features |
| Collibra | Enterprise governance platform, strong in regulated industries |
| Apache Atlas | Open-source, common in Hadoop/Cloudera environments |
SA Talking Points¶
- "Can any analyst in your company find the data they need without emailing someone?" — if no, a catalog is the answer
- The business glossary is often more valuable than the technical catalog — defining "revenue" consistently across Finance and Sales is a governance problem, not a technology problem
- Platform-native catalogs (Purview, Dataplex, Unity) reduce integration friction; third-party tools (Alation, Collibra) offer richer features at the cost of complexity
Data Lineage¶
What It Is¶
A record of how data flows from its source through transformations to its consumers — showing upstream dependencies and downstream impact for every column, table, and report.
Why It Matters¶
For compliance: "Show me where this customer's personal data comes from and who has accessed it" — lineage answers this.
For impact analysis: "If I change this source table, what dashboards will break?" — lineage answers this.
For trust: "Why does this number differ from last month's report?" — lineage shows which transformation caused it.
Lineage Levels¶
| Level | What It Shows | Example |
|---|---|---|
| Table lineage | Which tables feed which tables | orders → daily_sales_summary |
| Column lineage | Which source columns feed which target columns | orders.amount → sales_summary.total_revenue |
| Report lineage | Which tables feed which BI reports | daily_sales_summary → Tableau dashboard |
SA Talking Points¶
- Lineage is almost always a compliance request first ("our auditors need to see data flows") but becomes a productivity tool once teams use it
- Column-level lineage is much harder to implement than table lineage — Unity Catalog provides it automatically for Databricks SQL and notebooks
- Ask: "What happens when a source system changes a column name?" — if the answer is "we find out when dashboards break," lineage is a gap
Access Control Patterns¶
Role-Based Access Control (RBAC)¶
Users are assigned to roles, and permissions are granted to roles. Simple, manageable at scale.
SA_Team Role → READ access on sales.gold.*
Finance Role → READ access on finance.gold.*, WRITE access on finance.silver.*
Data Engineer Role → WRITE access on *.bronze.*, *.silver.*
Attribute-Based Access Control (ABAC)¶
Access is granted based on attributes of the user and the data asset — more flexible but more complex.
Users with attribute country=US can see US rows
Users with attribute clearance=PII can see unmasked SSN column
Row-Level Security (RLS)¶
A SQL predicate filters visible rows based on the current user's identity or group membership.
-- Only show rows matching the user's region
CREATE ROW FILTER region_filter ON sales.orders
AS (region = current_user_region());
Dynamic Data Masking¶
Sensitive columns are masked at query time based on the user's privileges — the underlying data is unchanged.
SA Talking Points¶
- RBAC is the right starting point — ABAC adds complexity that most organizations aren't ready for
- RLS and masking are the most common PII governance requests — Unity Catalog implements both without application changes
- "Do your contractors and full-time employees see the same data?" — if yes when it shouldn't be, RLS is the fix
Data Quality¶
Key Dimensions of Data Quality¶
| Dimension | Question It Answers |
|---|---|
| Completeness | Are required fields populated? |
| Accuracy | Does the data match the real-world truth? |
| Consistency | Does the same fact have the same value across systems? |
| Timeliness | Is the data fresh enough for its use case? |
| Uniqueness | Are there duplicate records? |
| Validity | Do values conform to expected formats and ranges? |
Data Quality Tools¶
| Tool | Notes |
|---|---|
| Databricks Expectations (DLT) | Native quality constraints in Delta Live Tables pipelines |
| Great Expectations | Open-source, widely used, generates quality reports |
| dbt Tests | SQL-based quality tests for dbt transformation models |
| Monte Carlo | Observability platform — detects anomalies automatically |
| Soda | Cloud-based quality monitoring, integrates with Slack/PD alerting |
SA Talking Points¶
- Data quality problems cost organizations millions in bad decisions — frame quality as a business risk, not a technical nicety
- "How do you know when data quality degrades?" — if the answer is "when someone complains," monitoring is missing
- Start with the most business-critical tables (Gold layer) and work backwards — don't try to quality-gate everything at once
SA Rule of Thumb: Governance wins are often organizational, not technical — the hardest part is getting agreement on who owns a definition, not implementing the policy. Start there.