Data Architect Glossary¶
Exhaustive reference of terms, acronyms, and concepts across all Data Architecture domains.
1. Storage & Architecture Patterns¶
| Term | Full Form / Description |
|---|---|
| EDW | Enterprise Data Warehouse — centralized repository for integrated, historical data from across the organization, optimized for reporting and analytics |
| ODS | Operational Data Store — near-real-time integration layer that consolidates data from source systems for operational reporting; not a replacement for EDW |
| Data Lake | Storage repository holding vast amounts of raw data in its native format (structured, semi-structured, unstructured) until needed; schema-on-read |
| Data Lakehouse | Architecture combining the low-cost storage of a data lake with the structure and ACID transactions of a data warehouse (e.g., Delta Lake, Apache Iceberg) |
| Data Warehouse | Subject-oriented, integrated, time-variant, non-volatile collection of data in support of management decisions (Inmon's definition) |
| Data Mart | A subset of a data warehouse focused on a single subject area or business unit (e.g., Sales Mart, Finance Mart) |
| Data Hub | Central integration point for data exchange between systems; often includes a metadata registry and routing logic |
| Data Mesh | Decentralized sociotechnical architecture where domain teams own and serve their data as products, with federated governance |
| Data Fabric | Architecture that provides consistent, automated data management across hybrid and multi-cloud environments using metadata and AI |
| Lambda Architecture | Hybrid batch + speed processing architecture; combines a batch layer (high latency, accurate) with a speed layer (low latency, approximate) |
| Kappa Architecture | Simplified streaming-only architecture that eliminates the batch layer; all data processed as streams |
| Medallion Architecture | Layered data organization: Bronze (raw), Silver (cleaned/conformed), Gold (business-level aggregates); popularized by Databricks |
| Bronze Layer | Raw ingestion layer in medallion architecture; data lands as-is from source systems |
| Silver Layer | Cleaned, deduplicated, and conformed data layer; applies data quality rules and joins |
| Gold Layer | Business-level aggregates and curated datasets ready for BI consumption |
| HTAP | Hybrid Transactional/Analytical Processing — systems that support both OLTP and OLAP workloads simultaneously |
| OLTP | Online Transactional Processing — systems optimized for high-volume, low-latency read/write operations (e.g., order entry systems) |
| MPP | Massively Parallel Processing — distributes query execution across many nodes simultaneously; used in Redshift, Snowflake, BigQuery |
| SMP | Symmetric Multiprocessing — multiple processors sharing a single memory space; limits scalability compared to MPP |
| Shared-Nothing Architecture | Each node in a cluster has its own CPU, memory, and disk; no resource contention; basis for most MPP systems |
| Shared-Disk Architecture | Nodes share storage but have independent CPUs/memory; common in cloud-native warehouses like Snowflake |
| Data Vault | Modeling methodology for enterprise data warehouses; highly scalable, auditable, and adaptable (see Data Modeling section) |
| Operational Analytics | Running analytics directly on operational/transactional data with minimal latency |
| Logical Data Warehouse | Virtual data warehouse layer that federates queries across multiple physical data stores without moving data |
| Virtual Data Warehouse | Query federation layer that presents a unified schema across disparate sources |
| Data Virtualization | Technology that provides real-time, unified data access across disparate systems without physical data movement |
| CDP | Customer Data Platform — system that creates a unified, persistent customer database accessible to other systems |
| MDM | Master Data Management — discipline for defining and managing critical data assets (customers, products, locations) to ensure single authoritative version |
| Reference Data | Data that defines valid values for other data fields (e.g., country codes, currency codes, status values) |
| Landing Zone | Initial staging area where raw data is deposited before processing; often synonymous with Bronze layer |
| Staging Area | Temporary storage area used during ETL for intermediate transformations before loading to target |
| Cold Storage | Low-cost, high-latency storage tier for infrequently accessed archival data |
| Hot Storage | High-cost, low-latency storage tier for frequently accessed, active data |
| Warm Storage | Middle-tier storage with moderate cost and access latency |
| Polyglot Persistence | Using multiple database technologies (relational, document, graph, key-value) best suited to each workload within one system |
| Event Store | Append-only log of all state-changing events in a system; basis for event sourcing pattern |
| Time-Series Database | Optimized for storing and querying timestamped data points (e.g., InfluxDB, TimescaleDB) |
| Graph Database | Stores data as nodes and edges to represent and traverse relationships (e.g., Neo4j, Amazon Neptune) |
| Document Store | Stores data as JSON/BSON documents with flexible schema (e.g., MongoDB, Couchbase) |
| Key-Value Store | Simplest NoSQL model; stores data as key-value pairs (e.g., Redis, DynamoDB) |
| Wide-Column Store | Stores data in rows with dynamic columns; suited for time-series and sparse data (e.g., Cassandra, HBase) |
| NewSQL | Databases providing ACID guarantees of traditional RDBMS with horizontal scalability of NoSQL (e.g., CockroachDB, Spanner) |
| In-Memory Database | Stores data primarily in RAM for ultra-low latency (e.g., Redis, SAP HANA, MemSQL) |
| Data Swamp | A poorly governed data lake where data quality, lineage, and discoverability have broken down |
| Unified Namespace | Single logical namespace that abstracts all data sources for query purposes |
| Reverse ETL | Process of syncing data from a data warehouse back into operational systems (CRM, marketing tools) |
| Zero-Copy Cloning | Creating a metadata-only copy of a dataset without duplicating physical storage (Snowflake feature) |
| Data Sharing | Capability to share live data across accounts or organizations without copying (Snowflake, Databricks Delta Sharing) |
| External Table | Table definition pointing to data stored outside the database (e.g., files in S3) |
| Federated Query | Querying data across multiple heterogeneous systems from a single SQL interface |
2. Data Modeling¶
| Term | Full Form / Description |
|---|---|
| Kimball | Ralph Kimball's dimensional modeling methodology; centers on fact and dimension tables optimized for business user queries |
| Inmon | Bill Inmon's enterprise data warehouse approach; normalized 3NF central warehouse with subject-area data marts fed from it |
| Data Vault 2.0 | DV2.0 — Dan Linstedt's hybrid modeling approach with Hubs (business keys), Links (relationships), and Satellites (context); agile and auditable |
| Anchor Modeling | Ultra-normalized modeling approach using anchors, attributes, ties, and knots; handles change well but complex to query |
| Activity Schema | Modern analytics modeling pattern organizing all data around a single entity timeline |
| Dimensional Model | Fact-and-dimension schema design for analytical databases; optimized for query performance and business usability |
| Fact Table | Central table in a dimensional model storing numeric measures and foreign keys to dimensions (e.g., Sales Fact) |
| Dimension Table | Descriptive attributes table providing context for facts (e.g., Date, Customer, Product dimensions) |
| Star Schema | Dimensional model with a central fact table surrounded by denormalized dimension tables; simple and fast to query |
| Snowflake Schema | Normalized variant of star schema where dimension tables are further normalized into sub-dimensions |
| Galaxy Schema | Multiple fact tables sharing dimension tables; also called Fact Constellation |
| Bridge Table | Resolves many-to-many relationships between fact and dimension tables |
| Degenerate Dimension | Dimension attribute stored directly in the fact table rather than a separate dimension table (e.g., invoice number) |
| Junk Dimension | Combines low-cardinality miscellaneous flags/indicators into a single dimension to reduce fact table width |
| Role-Playing Dimension | Single dimension table used multiple times in a fact table for different purposes (e.g., Date as Order Date, Ship Date) |
| Outrigger | Secondary dimension table attached to a primary dimension table (not directly to fact); used sparingly |
| Conformed Dimension | Dimension shared across multiple fact tables or data marts with consistent meaning and values |
| Conformed Fact | Fact measure with consistent definition and granularity across data marts |
| Grain | The level of detail represented by a single row in a fact table; must be declared before design begins |
| Factless Fact Table | Fact table with no numeric measures; captures events or coverage relationships (e.g., student enrollment) |
| Accumulating Snapshot | Fact table pattern tracking the lifecycle of a process with multiple date stamps updated as milestones are reached |
| Periodic Snapshot | Fact table capturing state at regular intervals (daily, weekly, monthly) |
| Transaction Fact Table | Records individual business events or transactions at the lowest grain |
| SCD | Slowly Changing Dimension — technique for managing changes to dimension attributes over time |
| SCD Type 0 | Dimension attributes never change; historical value is retained forever |
| SCD Type 1 | Overwrite old value; no history kept; current state only |
| SCD Type 2 | Add a new row for each change; full history preserved with effective dates and current flag |
| SCD Type 3 | Add a new column for the previous value; limited history (typically tracks only one prior value) |
| SCD Type 4 | Separate history table; current values in main table, all history in mini-dimension |
| SCD Type 6 | Hybrid combining Types 1, 2, and 3; adds current value column to Type 2 rows |
| Hub | Data Vault component storing a unique list of business keys with metadata (load date, source) |
| Link | Data Vault component capturing relationships between two or more Hubs |
| Satellite | Data Vault component storing descriptive context and history for a Hub or Link |
| PIT Table | Point-in-Time table in Data Vault — pre-joins satellites at specific snapshots for query performance |
| Bridge Table (DV) | Data Vault construct that pre-joins a chain of links and hubs for performance |
| Business Vault | Data Vault layer containing business rules and calculations applied to Raw Vault data |
| Raw Vault | Data Vault layer containing data as-received from sources, no business rules applied |
| 3NF | Third Normal Form — relational database design where all attributes depend only on the primary key; eliminates redundancy |
| 1NF | First Normal Form — all column values are atomic (indivisible), no repeating groups |
| 2NF | Second Normal Form — 1NF plus all non-key attributes fully depend on the entire composite primary key |
| BCNF | Boyce-Codd Normal Form — stronger version of 3NF; every determinant is a candidate key |
| Denormalization | Intentionally adding redundancy to a normalized schema to improve read query performance |
| ERD | Entity-Relationship Diagram — visual representation of entities and their relationships in a data model |
| Conceptual Model | High-level model showing key entities and relationships; no technical detail; used for stakeholder communication |
| Logical Model | Detailed data model with entities, attributes, and relationships; technology-agnostic |
| Physical Model | Implementation-specific model with tables, columns, data types, indexes, and constraints |
| Surrogate Key | System-generated artificial primary key (integer or UUID) assigned to dimension rows |
| Natural Key | Business-assigned identifier that has meaning outside the database (e.g., customer ID, SSN) |
| Composite Key | Primary key made up of two or more columns |
| Foreign Key | Column(s) referencing the primary key of another table to enforce referential integrity |
| Business Key | Identifier used in the business domain to uniquely identify an entity; basis for Data Vault Hubs |
| Cardinality | The number of unique values in a column; also describes relationship types (1:1, 1:M, M:N) |
| Granularity | Level of detail in a dataset; fine grain = more rows, each representing a smaller unit of measurement |
| Normalization | Process of organizing data to reduce redundancy and improve data integrity |
| Data Type | Classification of a column's values (INTEGER, VARCHAR, DATE, BOOLEAN, etc.) |
| Null Handling | How missing/unknown values are represented and treated in queries and aggregations |
| Temporal Table | Table that automatically tracks row history with valid-time or transaction-time columns; ISO SQL:2011 standard |
| Bi-Temporal Modeling | Tracking both valid time (when something was true in reality) and transaction time (when it was recorded) |
| Polymorphic Association | Single table stores relationships to multiple entity types; common anti-pattern in relational modeling |
| Anti-Pattern | A modeling or design choice that seems reasonable but causes problems (e.g., EAV for structured data) |
| EAV | Entity-Attribute-Value — stores data as rows of key-value pairs; flexible but hard to query and validate |
| Wide Table | Denormalized table with many columns; common in analytics/columnar stores for query performance |
| Schema Evolution | The ability to change a data schema (add/remove columns) without breaking existing consumers |
| Semantic Key | A meaningful business key embedded in a surrogate; combines audibility and performance |
| Hash Key | MD5 or SHA-1 hash of business key fields used as surrogate in Data Vault for deterministic, parallel loading |
3. Integration & Processing¶
| Term | Full Form / Description |
|---|---|
| ETL | Extract, Transform, Load — data integration pattern where transformation occurs before loading into target |
| ELT | Extract, Load, Transform — data is loaded raw into target system and transformed there using its compute power |
| CDC | Change Data Capture — technique for identifying and capturing data changes in source systems (insert, update, delete) |
| Log-Based CDC | CDC using database transaction logs (WAL, binlog) to capture changes without impacting source performance |
| Query-Based CDC | CDC using timestamps or watermarks in SQL queries to detect changed rows; higher source impact |
| Trigger-Based CDC | CDC using database triggers to capture changes; high overhead, generally avoided |
| Batch Processing | Processing data in discrete, scheduled chunks; high latency, high throughput |
| Micro-Batch | Near-real-time processing of small batches at short intervals (e.g., Spark Structured Streaming) |
| Stream Processing | Continuous processing of data as it arrives with very low latency (e.g., Apache Flink, Kafka Streams) |
| Real-Time Processing | Data processing with sub-second latency; used for alerts, fraud detection, live dashboards |
| Near-Real-Time | Processing with latency of seconds to minutes; acceptable for many operational analytics use cases |
| Data Pipeline | Series of processing steps that move and transform data from source to destination |
| Ingestion | Process of bringing data into a storage system from external sources |
| Message Queue | Asynchronous communication buffer between systems (e.g., RabbitMQ, Amazon SQS) |
| Event Streaming | Durable, ordered log of events accessible for replay by multiple consumers (e.g., Apache Kafka) |
| Kafka | Apache Kafka — distributed event streaming platform; uses topics, partitions, producers, and consumers |
| Pub/Sub | Publish-Subscribe messaging pattern where producers publish to topics and consumers subscribe independently |
| Topic | Named channel in a messaging system where events are published and consumed |
| Partition | Subdivision of a Kafka topic for parallelism and ordering guarantees |
| Consumer Group | Set of consumers that collectively read all partitions of a topic; enables parallel consumption |
| Exactly-Once Semantics | Processing guarantee that each message is processed exactly once, even in failure scenarios |
| At-Least-Once | Processing guarantee that a message will be processed at minimum once; duplicates possible |
| At-Most-Once | Processing guarantee that a message is processed no more than once; data loss possible |
| Idempotency | Property of an operation that can be applied multiple times without changing the result beyond the first application |
| Backpressure | Mechanism for a downstream system to signal an upstream system to slow data production |
| Watermark | In stream processing, a threshold indicating how late events can arrive and still be included in a window |
| Event Time | The time when an event actually occurred in the real world |
| Processing Time | The time when an event is processed by the stream processor |
| Tumbling Window | Fixed-size, non-overlapping time window for stream aggregations |
| Sliding Window | Overlapping time windows that advance by a step smaller than the window size |
| Session Window | Dynamic window that groups events within a period of activity, closing after a gap of inactivity |
| Late Arriving Data | Events that arrive after the expected processing window; requires special handling strategies |
| Upsert | Operation that inserts a new record or updates an existing one based on a key match (UPDATE + INSERT) |
| Merge | SQL/DML operation combining INSERT, UPDATE, and DELETE in a single statement based on match conditions |
| Full Refresh | Loading strategy that truncates and reloads an entire table; simple but expensive for large datasets |
| Incremental Load | Loading only new or changed records since the last extraction; requires reliable watermarking |
| Delta Load | Synonym for incremental load; loading only the "delta" (changes) since the last run |
| Data Replication | Copying data from one system to another to ensure availability, redundancy, or geographic distribution |
| Webhook | HTTP callback that pushes data to a URL when an event occurs; event-driven integration pattern |
| API Integration | Connecting systems via REST, SOAP, or GraphQL APIs to exchange data |
| REST | Representational State Transfer — stateless HTTP-based API design style; uses GET, POST, PUT, DELETE |
| GraphQL | Query language for APIs that allows clients to request exactly the data they need |
| gRPC | Google Remote Procedure Call — high-performance RPC framework using Protocol Buffers |
| Protocol Buffers | Protobuf — Google's binary serialization format; more efficient than JSON/XML |
| Avro | Apache Avro — compact binary data serialization format with schema evolution support; common in Kafka |
| Parquet | Apache Parquet — columnar binary storage format; efficient for analytical queries; default in many lake formats |
| ORC | Optimized Row Columnar — columnar format developed for Hive; includes built-in indexes and statistics |
| JSON | JavaScript Object Notation — lightweight, human-readable data interchange format |
| CSV | Comma-Separated Values — plain-text tabular format; ubiquitous but lacks schema enforcement |
| XML | Extensible Markup Language — hierarchical text-based format; verbose but self-describing |
| DAG | Directed Acyclic Graph — representation of pipeline dependencies; used by Airflow, dbt, and other orchestrators |
| Orchestration | Coordinating and scheduling the execution of pipeline tasks and dependencies |
| Airflow | Apache Airflow — open-source workflow orchestration platform using Python DAGs |
| Prefect | Modern Python-native orchestration platform with dynamic workflows |
| Dagster | Data-aware orchestration platform with built-in asset tracking and lineage |
| dbt | Data Build Tool — SQL-based transformation framework for the ELT pattern; version-controls SQL models |
| Fivetran | Managed connector service for automated data ingestion from SaaS and database sources |
| Airbyte | Open-source data integration platform with a large connector catalog |
| Singer | Open-source data integration specification using taps (sources) and targets (destinations) |
| Debezium | Open-source CDC platform that captures database changes via transaction logs |
| Flink | Apache Flink — distributed stream processing framework with exactly-once guarantees |
| Spark | Apache Spark — unified analytics engine for large-scale batch and streaming data processing |
| Spark Streaming | Micro-batch streaming layer on Apache Spark |
| Structured Streaming | Spark's continuous streaming API built on top of Spark SQL |
| Beam | Apache Beam — unified programming model for batch and streaming pipelines; runs on multiple runners |
| Dataflow | Google Cloud Dataflow — managed Apache Beam service |
| Kinesis | Amazon Kinesis — managed real-time data streaming service |
| Event Hub | Azure Event Hubs — fully managed event ingestion service; Kafka-compatible |
| Pub/Sub (GCP) | Google Cloud Pub/Sub — managed messaging and ingestion service |
| SQS | Amazon Simple Queue Service — managed message queuing service |
| SNS | Amazon Simple Notification Service — managed pub/sub messaging for fan-out patterns |
| Data Contracts | Formal agreement between data producers and consumers defining schema, quality, and SLA expectations |
| Schema-on-Read | Schema is applied when data is read, not when it is written; enables flexible ingestion |
| Schema-on-Write | Schema is enforced when data is written; ensures data quality at the point of ingestion |
| Data Serialization | Converting data structures to a format suitable for storage or transmission |
| Compression | Reducing data size using algorithms (GZIP, Snappy, LZ4, ZSTD); critical for storage cost and I/O performance |
| Partitioning | Dividing data into logical segments (by date, region, etc.) to improve query performance and data management |
| Bucketing | Sub-partitioning data into fixed buckets by hash of a column; improves join and aggregation performance |
| Z-Ordering | Multi-dimensional clustering technique (Delta Lake) that co-locates related data to reduce query I/O |
| Clustering | Physically organizing data on disk by one or more columns to improve range query performance |
| Pushdown Predicate | Filtering applied at the storage layer before data is sent to the compute layer; reduces data scan |
| Data Skew | Uneven distribution of data across partitions causing some tasks to run much longer than others |
| Shuffle | Redistribution of data across partitions during operations like joins and aggregations in distributed systems |
| Broadcast Join | Join optimization where a small table is replicated to all nodes to avoid shuffling a large table |
4. Metadata & Data Governance¶
| Term | Full Form / Description |
|---|---|
| DAMA | Data Management Association — professional organization that publishes the DMBOK framework |
| DMBOK | Data Management Body of Knowledge — DAMA's comprehensive framework for data management disciplines |
| Data Governance | Framework of policies, processes, standards, and roles that ensure data is managed as a strategic asset |
| Data Steward | Person responsible for the quality and fitness of a specific data domain or dataset |
| Data Owner | Business executive accountable for the quality, security, and appropriate use of a data asset |
| Data Custodian | IT role responsible for technical management and storage of data assets |
| Data Catalog | Metadata repository providing searchable inventory of data assets with descriptions, lineage, and quality info |
| Business Glossary | Curated dictionary of business terms with agreed definitions; foundation for data governance |
| Data Dictionary | Technical documentation of datasets, tables, and columns including types, constraints, and descriptions |
| Data Lineage | End-to-end tracking of data origin, movement, transformations, and consumption |
| Impact Analysis | Using lineage to assess downstream effects of a proposed schema or pipeline change |
| Active Metadata | Metadata that drives automated decisions and actions (e.g., triggering quality checks, routing data) |
| Passive Metadata | Metadata used for documentation and discovery but not for automation |
| Technical Metadata | Information about data structure, format, storage, and access (table schemas, file sizes, partitions) |
| Business Metadata | Context describing business meaning, ownership, and usage policies |
| Operational Metadata | Information about data pipelines, job runs, data volumes, and processing history |
| Data Classification | Categorizing data by sensitivity level (Public, Internal, Confidential, Restricted) |
| PII | Personally Identifiable Information — any data that can identify a specific individual (name, SSN, email) |
| PHI | Protected Health Information — health data protected under HIPAA regulations |
| PCI DSS | Payment Card Industry Data Security Standard — security standard for handling cardholder data |
| GDPR | General Data Protection Regulation — EU regulation governing personal data collection and processing |
| CCPA | California Consumer Privacy Act — California law giving consumers rights over personal data |
| HIPAA | Health Insurance Portability and Accountability Act — US law protecting medical information |
| Data Residency | Requirement that data be stored and processed within specific geographic boundaries |
| Data Sovereignty | Legal concept that data is subject to the laws of the country where it is collected or stored |
| Right to be Forgotten | GDPR right allowing individuals to request deletion of their personal data |
| Data Minimization | Principle of collecting only the minimum data necessary for a stated purpose |
| Purpose Limitation | Principle that data collected for one purpose should not be used for a different purpose |
| Consent Management | Systems and processes for capturing, storing, and enforcing user consent for data processing |
| Data Retention Policy | Rules governing how long data is kept before archival or deletion |
| Data Lifecycle Management | DLM — governing data from creation through archival and deletion |
| Data Access Control | Policies and mechanisms controlling who can read, write, or modify specific data assets |
| RBAC | Role-Based Access Control — granting data access based on user roles rather than individual identities |
| ABAC | Attribute-Based Access Control — access decisions based on attributes of users, resources, and environment |
| Column-Level Security | Restricting access to specific columns in a table based on user role or attribute |
| Row-Level Security | RLS — filtering rows returned to a user based on their identity or role |
| Dynamic Data Masking | Masking sensitive data in query results without changing the stored data |
| Tokenization | Replacing sensitive data values with non-sensitive tokens; original value stored in a secure vault |
| Encryption at Rest | Encrypting data while stored on disk; protects against physical media theft |
| Encryption in Transit | Encrypting data as it moves over networks (TLS/SSL); protects against interception |
| Key Management | Managing cryptographic keys for encryption; critical for key rotation and access control |
| Audit Trail | Immutable log of all data access and modification events for compliance and forensic purposes |
| Data Trust Score | Metric quantifying the reliability and quality of a dataset based on multiple quality dimensions |
| Data Policy | Formal rules governing data collection, use, storage, sharing, and disposal |
| Data Standard | Agreed specifications for data formats, definitions, coding, and quality rules |
| Interoperability | Ability of different systems to exchange and use data without special integration effort |
| Data Portability | Ability to transfer data from one system to another in a usable format |
| Metadata Management | Discipline for collecting, storing, maintaining, and using metadata to improve data usability |
| Apache Atlas | Open-source metadata management and governance platform for Hadoop ecosystems |
| Collibra | Enterprise data governance and catalog platform |
| Alation | AI-based data catalog and governance platform |
| Atlan | Modern collaborative data catalog with active metadata capabilities |
| DataHub | LinkedIn's open-source metadata platform for data discovery and lineage |
| OpenMetadata | Open-source metadata platform and data catalog |
| Unity Catalog | Databricks' unified governance layer for data and AI assets |
| Informatica IDMC | Informatica Intelligent Data Management Cloud — enterprise data management platform |
| Waterline Data | AI-driven data discovery and cataloging tool (acquired by Infogix) |
| Data Contract | Formal schema and quality agreement between data producers and consumers |
| Schema Registry | Central repository for managing and versioning event schemas (Confluent, AWS Glue) |
| Data Mesh Governance | Federated computational governance in Data Mesh; policies enforced via automated platforms, not central teams |
| FAIR Principles | Findable, Accessible, Interoperable, Reusable — principles for scientific data management |
5. Semantic Layer & Ontology¶
| Term | Full Form / Description |
|---|---|
| Semantic Layer | Abstraction that translates business terms into technical data queries; sits between data and BI tools |
| Metric Layer | Centralized definition of business metrics (revenue, churn) ensuring consistent calculation everywhere |
| Headless BI | Metric definitions and business logic decoupled from any specific BI tool; accessible via API |
| Metrics Store | Repository of defined, versioned, and governed business metrics (e.g., dbt Metrics, Cube.js) |
| Ontology | Formal representation of knowledge including concepts, categories, and relationships in a domain |
| Knowledge Graph | Graph structure that represents real-world entities and semantic relationships between them |
| RDF | Resource Description Framework — W3C standard for representing information as subject-predicate-object triples |
| RDFS | RDF Schema — vocabulary extension for RDF providing class and property hierarchies |
| OWL | Web Ontology Language — W3C standard for creating rich ontologies on top of RDF |
| SPARQL | SPARQL Protocol and RDF Query Language — query language for RDF data stores |
| Triple | Atomic unit of RDF data: subject (entity) + predicate (relationship) + object (entity or value) |
| Named Graph | RDF graph with an associated URI; allows managing and querying subsets of a triple store |
| Linked Data | Practice of using RDF and HTTP URIs to publish and connect structured data on the web |
| SKOS | Simple Knowledge Organization System — W3C standard for expressing controlled vocabularies |
| Taxonomy | Hierarchical classification system for organizing concepts into parent-child relationships |
| Thesaurus | Vocabulary of preferred terms with synonyms, broader/narrower terms, and related terms |
| Controlled Vocabulary | Standardized set of terms used for consistent tagging and classification |
| URI | Uniform Resource Identifier — globally unique identifier for resources in linked data and the web |
| Property Graph | Graph model where nodes and edges can have properties; used by Neo4j, TigerGraph |
| Cypher | Declarative query language for property graphs; used by Neo4j |
| Gremlin | Graph traversal language for property graphs; part of Apache TinkerPop |
| Triple Store | Database optimized for storing and querying RDF triples (e.g., Stardog, GraphDB, Amazon Neptune) |
| Inference / Reasoning | Deriving new facts from existing knowledge using ontology rules (e.g., OWL reasoning) |
| Semantic Search | Search that understands the meaning and context of queries rather than just keyword matching |
| Entity Resolution | Process of identifying when records from different sources refer to the same real-world entity |
| Entity Extraction | NLP technique for identifying named entities (people, places, organizations) in unstructured text |
| Knowledge Representation | Formal encoding of domain knowledge for automated reasoning |
| dbt Semantic Layer | dbt's implementation of a metric layer allowing consistent metric definitions across BI tools |
| LookML | Looker's proprietary data modeling language for defining metrics, dimensions, and explores |
| AtScale | Semantic layer and universal data model platform |
| Cube.js | Open-source analytical API platform with a metric/semantic layer |
| Superset Semantic Layer | Apache Superset's virtual dataset layer for metric definitions |
| Open Metadata Standards | Efforts like OpenLineage, OpenMetadata, and dbt Contracts to standardize metadata exchange |
| Conceptual Schema | Technology-agnostic representation of business concepts and their relationships |
| Domain Model | Model representing concepts, relationships, and rules within a specific business domain |
| Fact vs. Dimension (semantic) | Semantic layer concepts mapping to numerical metrics (facts) vs. descriptive attributes (dimensions) |
6. BI & Analytics¶
| Term | Full Form / Description |
|---|---|
| OLAP | Online Analytical Processing — multi-dimensional analysis of business data for decision support |
| ROLAP | Relational OLAP — OLAP implemented on relational databases using SQL; no pre-aggregated cubes |
| MOLAP | Multidimensional OLAP — data stored in pre-aggregated multidimensional arrays; fast queries, less flexible |
| HOLAP | Hybrid OLAP — combines ROLAP and MOLAP; aggregates in multidimensional store, details in relational |
| Cube | Multi-dimensional data structure with pre-calculated aggregates along dimensions and hierarchies |
| MDX | MultiDimensional Expressions — query language for OLAP cubes (Microsoft Analysis Services, Essbase) |
| DAX | Data Analysis Expressions — formula language used in Power BI, Power Pivot, and SSAS Tabular models |
| Power Query / M | Microsoft's data transformation language and engine embedded in Power BI and Excel |
| Measure | Numeric value calculated in a BI context (sum, average, count); can be implicit or explicit |
| Dimension (BI) | Categorical attribute used to slice and filter measures (Time, Geography, Product) |
| Hierarchy | Ordered levels within a dimension enabling drill-down (Year > Quarter > Month > Day) |
| KPI | Key Performance Indicator — quantifiable metric reflecting critical business objectives |
| OKR | Objectives and Key Results — goal-setting framework pairing aspirational goals with measurable outcomes |
| Drill-Down | Navigating from a summary level to more detailed data (e.g., Year → Quarter → Month) |
| Drill-Up / Roll-Up | Aggregating detail data to a higher summary level |
| Drill-Through | Accessing underlying transaction-level detail records from a summarized BI view |
| Slice | Filtering a cube on one dimension to a specific value |
| Dice | Filtering a cube on multiple dimensions simultaneously |
| Pivot | Rotating a data table to exchange rows and columns for different analytical perspectives |
| Cross-Tab | Cross-tabulation — matrix display with row and column totals; pivot table equivalent |
| Scorecard | Dashboard presenting KPIs against targets, typically using RAG (Red/Amber/Green) status indicators |
| Dashboard | Visual display of key metrics and data visualizations for at-a-glance monitoring |
| Report | Structured presentation of data with filtering and formatting; less interactive than a dashboard |
| Ad Hoc Query | Unscheduled, user-defined query created on-demand to answer a specific business question |
| LOD | Level of Detail — Tableau expression type (FIXED, INCLUDE, EXCLUDE) for controlling aggregation scope |
| Table Calculation | Tableau calculations performed on the result set after aggregation, not on underlying data |
| Calculated Field | User-defined computed column or measure within a BI tool |
| Aggregation | Summarizing multiple rows into a single value (SUM, COUNT, AVG, MIN, MAX) |
| Running Total | Cumulative sum of a measure over an ordered dimension (e.g., year-to-date revenue) |
| YTD | Year-to-Date — cumulative value from the start of the fiscal/calendar year to current date |
| MTD | Month-to-Date — cumulative value from the start of the current month |
| QTD | Quarter-to-Date — cumulative value from the start of the current quarter |
| Period-over-Period | Comparing a metric in one period to the same metric in a prior period (YoY, MoM, WoW) |
| Time Intelligence | BI functions for date-based calculations (SAMEPERIODLASTYEAR, DATEADD in DAX) |
| Parameterized Report | Report with user-defined input parameters that filter or modify the output |
| Paginated Report | Fixed-layout, print-optimized report; suitable for invoices, statements, pixel-perfect output |
| Self-Service BI | BI tools enabling business users to create their own reports and analyses without IT assistance |
| Augmented Analytics | AI/ML-powered analytics that automates insight discovery, data preparation, and explanation |
| Natural Language Query | NLQ — querying data using plain language questions; enabled by NLP in BI tools |
| Embedded Analytics | BI capabilities integrated directly into business applications rather than standalone tools |
| Pixel-Perfect Reporting | Reports with precise layout control for printing; contrasted with interactive dashboards |
| Semantic Model | In Power BI, the dataset layer that defines measures, hierarchies, relationships, and row-level security |
| Tabular Model | Microsoft SSAS model type storing data in-memory in columnar format for fast DAX queries |
| Multidimensional Model | Microsoft SSAS model using MDX cubes with pre-aggregated measures |
| Composite Model | Power BI model combining DirectQuery and import sources in the same dataset |
| DirectQuery | Power BI/Tableau mode sending live queries to the source system; no local data import |
| Import Mode | Power BI mode that loads data into an in-memory columnar engine (VertiPaq) |
| VertiPaq | In-memory columnar storage engine powering Power BI's import mode |
| Tableau | Popular BI and data visualization platform known for drag-and-drop exploration |
| Looker | Google's data platform for BI with LookML semantic modeling |
| Power BI | Microsoft's BI suite including Desktop, Service, and Mobile components |
| MicroStrategy | Enterprise BI platform known for large-scale deployments and HyperIntelligence |
| Qlik | BI platform with associative data model engine (QlikView, Qlik Sense) |
| Superset | Apache Superset — open-source BI and data visualization platform |
| Metabase | Open-source BI tool designed for simplicity and self-service |
| Redash | Open-source data visualization and querying tool |
| Grafana | Open-source observability and visualization platform; strong for time-series and operational metrics |
| A/B Testing | Controlled experiment comparing two variants (A and B) to determine which performs better |
| Cohort Analysis | Analyzing behavior of groups of users who share a common characteristic at a point in time |
| Funnel Analysis | Tracking sequential steps users take toward a conversion goal |
| Retention Analysis | Measuring how many users return over time after initial engagement |
| Attribution Modeling | Assigning credit for a conversion to different marketing touchpoints |
| Segmentation | Dividing users or data into groups based on shared attributes for targeted analysis |
7. Cloud & Infrastructure¶
| Term | Full Form / Description |
|---|---|
| Columnar Storage | Data stored by column rather than row; dramatically improves analytical query performance and compression |
| Row-Based Storage | Traditional RDBMS storage; efficient for OLTP (full row access) but poor for analytical (column scans) |
| Compression Ratio | Measure of how much data is reduced by compression; columnar formats achieve much higher ratios than row-based |
| Data Skipping | Using metadata (min/max statistics, bloom filters) to skip irrelevant data files during query execution |
| Bloom Filter | Probabilistic data structure used to test whether an element is in a set; used in Parquet and Iceberg |
| Zone Maps | Column-level min/max statistics stored in file metadata for data skipping optimization |
| Open Table Format | Specification defining how data lake files are organized and queried with ACID support |
| Delta Lake | Linux Foundation open table format providing ACID transactions and schema enforcement on data lakes |
| Apache Iceberg | Open table format for huge analytic datasets with schema evolution and time travel |
| Apache Hudi | Hadoop Upserts Deletes and Incrementals — open table format with CDC and incremental processing |
| ACID | Atomicity, Consistency, Isolation, Durability — properties guaranteeing reliable database transactions |
| Atomicity | Transaction property: all operations succeed or none do; no partial updates |
| Consistency (ACID) | Transaction property: database moves from one valid state to another |
| Isolation | Transaction property: concurrent transactions don't interfere with each other |
| Durability | Transaction property: committed transactions survive system failures |
| MVCC | Multi-Version Concurrency Control — allows concurrent reads and writes by maintaining multiple data versions |
| Time Travel | Ability to query data as it existed at a past point in time; supported by Delta Lake, Iceberg, Snowflake |
| Data Versioning | Tracking changes to datasets over time, enabling rollback and historical queries |
| Snapshot Isolation | Transaction isolation level where reads see a consistent snapshot of data at the transaction start time |
| Write-Ahead Log | WAL — transaction log written before data changes; enables crash recovery and CDC |
| Optimistic Concurrency | Allows multiple transactions to proceed without locking; checks for conflicts at commit time |
| Pessimistic Concurrency | Locks data resources during a transaction to prevent conflicts; reduces throughput |
| Manifest File | Metadata file in Iceberg/Delta that tracks which data files are part of a table snapshot |
| Transaction Log | Ordered record of all changes to a Delta Lake table; basis for ACID properties and time travel |
| Compaction | Process of merging many small files into fewer larger files to improve read performance |
| Small File Problem | Performance issue where too many small files cause excessive metadata overhead and slow queries |
| Auto-Optimize | Databricks feature that automatically compacts small files |
| Liquid Clustering | Databricks' dynamic, incremental replacement for static partition-based clustering |
| Copy-on-Write | Table format update strategy: rewrites entire data files on every update; better for read-heavy workloads |
| Merge-on-Read | Table format update strategy: writes delta files on update; merges at read time; better for write-heavy workloads |
| Serverless | Computing model where infrastructure management is abstracted away; auto-scales and billed per use |
| Elastic Scaling | Automatically adding or removing compute resources based on workload demand |
| Separation of Storage and Compute | Architecture where data storage and query compute scale independently; foundational to Snowflake, BigQuery |
| Virtual Warehouse | Snowflake's independent compute cluster that can be sized and scaled independently |
| Slot | BigQuery's unit of computational capacity; reserved or on-demand pricing |
| Redshift | Amazon Redshift — managed MPP data warehouse on AWS |
| BigQuery | Google BigQuery — serverless, multi-cloud data warehouse |
| Snowflake | Cloud data platform with separated storage and compute; supports multi-cloud deployment |
| Synapse Analytics | Azure Synapse Analytics — integrated analytics service combining data warehousing and big data |
| Databricks | Unified analytics platform built on Apache Spark; pioneered the Lakehouse concept |
| S3 | Amazon Simple Storage Service — object storage; de facto standard for data lake storage |
| ADLS | Azure Data Lake Storage — Microsoft's scalable object storage for analytics workloads |
| GCS | Google Cloud Storage — Google's object storage service |
| Object Storage | Storage model using objects with metadata and unique IDs; highly scalable, no hierarchy |
| Block Storage | Raw storage volumes (EBS, Azure Disks); low latency, used for databases |
| File Storage | Hierarchical filesystem-based storage (EFS, Azure Files, NFS) |
| Data Transfer Cost | Cloud charges for moving data between regions, availability zones, or out of cloud |
| Egress Cost | Charges for data transferred out of a cloud provider's network |
| Reserved Capacity | Pre-purchasing cloud compute/storage at discounted rates vs. on-demand pricing |
| Spot/Preemptible Instance | Unused cloud capacity sold at steep discount; can be interrupted; used for fault-tolerant batch jobs |
| VPC | Virtual Private Cloud — isolated network environment within a cloud provider |
| Private Link | Direct private network connection between services without traversing the public internet |
| IAM | Identity and Access Management — cloud service for managing authentication and authorization |
| Service Account | Non-human identity used by applications and services for authentication to cloud APIs |
| Encryption Key Management | KMS — managed service for creating and controlling encryption keys (AWS KMS, Azure Key Vault, GCP KMS) |
| Data Plane | The infrastructure layer that processes and stores actual data |
| Control Plane | The infrastructure layer managing metadata, configuration, and orchestration |
| Multi-Cloud | Strategy of using services from multiple cloud providers to avoid lock-in or optimize cost/capability |
| Hybrid Cloud | Combining on-premises infrastructure with public cloud services |
| Data Gravity | Tendency for data to attract applications and services; moving large datasets is expensive |
| Vendor Lock-In | Dependency on a single vendor's proprietary technology that makes migration costly |
| Open Standards | Non-proprietary specifications enabling interoperability (Parquet, Iceberg, OpenTelemetry) |
| Cost Allocation Tags | Cloud resource tags used for billing attribution by team, project, or environment |
| FinOps | Financial Operations — practice of managing cloud costs and optimizing cloud financial management |
| Infrastructure as Code | IaC — managing infrastructure through version-controlled configuration files (Terraform, Pulumi) |
| Terraform | HashiCorp's open-source IaC tool for provisioning cloud infrastructure |
| Docker | Container platform enabling consistent application packaging and deployment |
| Kubernetes | K8s — container orchestration system for automating deployment, scaling, and management |
| Helm | Package manager for Kubernetes applications |
| CI/CD | Continuous Integration/Continuous Delivery — automated build, test, and deployment pipelines |
8. Data Quality & Observability¶
| Term | Full Form / Description |
|---|---|
| Data Quality | Degree to which data is fit for its intended use; measured across multiple dimensions |
| DQ Dimensions | Standard aspects of data quality: completeness, accuracy, consistency, timeliness, validity, uniqueness |
| Completeness | DQ dimension: percentage of required fields that contain values; no unexpected nulls |
| Accuracy | DQ dimension: data correctly reflects the real-world entity or event it describes |
| Consistency (DQ) | DQ dimension: data values are consistent across systems and over time |
| Timeliness | DQ dimension: data is available when needed and reflects current reality |
| Validity | DQ dimension: data conforms to defined formats, ranges, and rules |
| Uniqueness | DQ dimension: no unintended duplicate records exist |
| Integrity | DQ dimension: referential and relational integrity is maintained across datasets |
| Conformity | DQ dimension: data adheres to specified data standards and formats |
| Data Anomaly | Unexpected deviation from normal data patterns; may indicate quality issues or real events |
| Anomaly Detection | Automated identification of unusual patterns in data (statistical, ML-based approaches) |
| Data Drift | Gradual change in data distribution over time that degrades model or report accuracy |
| Schema Drift | Unexpected changes to a data source's schema that break downstream pipelines |
| Data Observability | End-to-end visibility into data health including freshness, volume, schema, distribution, and lineage |
| Data Reliability | Consistent availability of high-quality, trustworthy data to consumers |
| Freshness | How recently data was updated; critical SLA for operational and real-time analytics |
| Volume Anomaly | Unexpected spike or drop in row counts, signaling ingestion or source issues |
| Distribution Shift | Change in statistical distribution of column values over time |
| Null Rate | Percentage of null values in a column; tracked as a quality and freshness indicator |
| Duplicate Rate | Percentage of duplicate records in a dataset |
| Data SLA | Service Level Agreement for data products defining freshness, completeness, and availability targets |
| Great Expectations | Open-source Python library for defining, documenting, and validating data quality expectations |
| dbt Tests | dbt's built-in and extensible data testing framework (not_null, unique, accepted_values, relationships) |
| Soda | Data quality platform for defining and running data checks across SQL and file sources |
| Monte Carlo | Data observability platform using ML to detect and alert on data quality issues |
| Bigeye | Data observability and monitoring platform with automated anomaly detection |
| Acceldata | Data observability and pipeline intelligence platform |
| MoD | Metrics on Data — tracking operational metrics (row count, null %, freshness) on datasets over time |
| Data Testing | Automated validation of data against defined expectations or business rules |
| Unit Test (data) | Testing individual transformation logic with known inputs and expected outputs |
| Integration Test (data) | End-to-end test validating that the complete pipeline produces correct results |
| Data Reconciliation | Comparing data between source and target systems to verify completeness and accuracy of transfers |
| Checksum | Hash value computed over data to verify integrity during transfer or storage |
| Row Count Validation | Comparing record counts between source and target as a basic completeness check |
| Referential Integrity | Constraint ensuring foreign key values exist in the referenced primary key table |
| Business Rule Validation | Testing data against domain-specific rules (e.g., order amount must be positive) |
| Statistical Process Control | SPC — applying statistical methods to monitor and control data quality over time |
| Control Chart | Visualization plotting quality metrics over time with upper/lower control limits |
| Data Profiling | Automated analysis of data to understand structure, completeness, distribution, and relationships |
| Column Statistics | Min, max, mean, median, standard deviation, null count computed per column |
| Cardinality Check | Validating the number of distinct values in a column against expectations |
| Pattern Matching | Validating that values conform to expected formats using regex (e.g., email, phone) |
| Range Check | Validating that numeric or date values fall within expected bounds |
| Cross-Field Validation | Validating relationships between multiple columns (e.g., end_date >= start_date) |
| Golden Record | The authoritative, trusted version of a master data entity after deduplication |
| Data Deduplication | Process of identifying and removing duplicate records from a dataset |
| Fuzzy Matching | Identifying similar (not exact) records using string distance algorithms |
| Record Linkage | Linking records across systems that refer to the same entity without a common identifier |
| Probabilistic Matching | Matching records using statistical likelihood scores across multiple attributes |
| Data Quality Score | Composite metric summarizing overall quality of a dataset across multiple dimensions |
| Data Health Dashboard | Centralized view of data quality metrics and SLA compliance across data assets |
| Alerting | Automated notifications when data quality metrics breach defined thresholds |
| Root Cause Analysis | Systematic process for identifying the underlying cause of a data quality issue |
| Incident Management | Process for tracking, escalating, and resolving data quality incidents |
| Data Debugging | Tracing data issues through pipelines using lineage and observability tools |
9. Emerging Concepts¶
| Term | Full Form / Description |
|---|---|
| Data Product | A self-contained, discoverable, addressable, trustworthy, and interoperable data asset served by a domain team |
| Domain Ownership | Data Mesh principle: teams closest to the data own its quality, availability, and access |
| Data as a Product | Data Mesh principle: applying product thinking (SLAs, discoverability, documentation) to data assets |
| Self-Serve Data Platform | Data Mesh principle: infrastructure platform enabling domain teams to build and serve data products independently |
| Federated Computational Governance | Data Mesh principle: global policies enforced through automated platforms rather than central data teams |
| DataOps | Agile methodology applying DevOps principles to data engineering; emphasizes automation, collaboration, and quality |
| MLOps | Machine Learning Operations — practices for deploying, monitoring, and maintaining ML models in production |
| LLMOps | Practices for deploying and managing Large Language Model applications in production |
| ModelOps | Operationalizing all types of AI/ML models including statistical, ML, and deep learning models |
| Feature Store | Centralized repository for storing, sharing, and serving ML features for training and inference |
| Feature Engineering | Transforming raw data into features (input variables) that improve ML model performance |
| Feature Serving | Providing low-latency access to ML features for real-time model inference |
| Online Store | Feature store layer for low-latency feature retrieval for real-time inference (e.g., Redis) |
| Offline Store | Feature store layer for high-throughput feature retrieval for batch training (e.g., S3, BigQuery) |
| Training-Serving Skew | Discrepancy between features used in model training vs. what's served in production; major ML risk |
| Model Registry | Versioned repository for storing, tracking, and managing ML model artifacts |
| Experiment Tracking | Recording hyperparameters, metrics, and artifacts from ML training runs (MLflow, W&B) |
| MLflow | Open-source platform for managing the end-to-end ML lifecycle |
| Kubeflow | Kubernetes-native platform for deploying and managing ML workflows |
| Vector Database | Database optimized for storing and querying high-dimensional embedding vectors (Pinecone, Weaviate, Qdrant) |
| Embedding | Dense numerical vector representation of data (text, images, audio) capturing semantic meaning |
| RAG | Retrieval-Augmented Generation — LLM pattern that retrieves relevant context from a knowledge base before generating responses |
| ANN | Approximate Nearest Neighbor — algorithms for finding similar vectors efficiently (HNSW, IVF) |
| HNSW | Hierarchical Navigable Small World — graph-based algorithm for fast ANN search in vector databases |
| Cosine Similarity | Metric measuring the angle between two vectors; common similarity measure for embeddings |
| Fine-Tuning | Adapting a pre-trained LLM to a specific task by training on domain-specific data |
| Prompt Engineering | Designing effective prompts to get desired outputs from LLMs without changing model weights |
| LLM | Large Language Model — deep learning model trained on massive text datasets for NLP tasks (GPT-4, Claude) |
| Foundation Model | Large pre-trained model serving as a base that can be fine-tuned for specific tasks |
| Open Lakehouse | Lakehouse built on open standards (Iceberg, Delta, Parquet) without vendor lock-in |
| Streaming Lakehouse | Real-time streaming ingestion and processing directly on lakehouse table formats |
| Knowledge Fabric | Evolution of data fabric incorporating semantic knowledge graphs and ontologies |
| Semantic Data Fabric | Data fabric with rich semantic metadata enabling context-aware data discovery and access |
| Augmented Data Management | Using AI/ML to automate metadata management, data quality, and governance tasks |
| Generative AI for Data | Using LLMs for automated SQL generation, data documentation, anomaly explanation, and data exploration |
| Text-to-SQL | LLM capability to convert natural language questions into executable SQL queries |
| Data Contract (modern) | Machine-readable schema + quality + SLA agreement between producers and consumers; versioned and enforced |
| Open Data Contract Standard | ODCS — open specification for data contracts enabling interoperability between governance tools |
| Soda Core | Open-source data quality CLI tool supporting data contracts and checks |
| Streaming SQL | Writing streaming data pipelines using familiar SQL syntax (Flink SQL, ksqlDB, Apache Calcite) |
| ksqlDB | SQL streaming engine built on top of Apache Kafka |
| Materialize | Streaming database maintaining incrementally-updated materialized views from streaming sources |
| RisingWave | Cloud-native streaming database with PostgreSQL-compatible SQL |
| Unified Data Platform | Single platform combining data engineering, warehousing, analytics, and ML (Databricks, Snowflake) |
| Iceberg REST Catalog | Open REST API specification for interacting with Iceberg catalogs; enables multi-engine access |
| Apache Polaris | Open-source Iceberg catalog supporting the Iceberg REST Catalog spec |
| Apache Gravitino | Open-source unified metadata lake for managing metadata across heterogeneous data sources |
| Nessie | Open-source catalog with Git-like version control for Iceberg and Delta tables |
| Table Format Wars | Industry competition between Delta Lake, Apache Iceberg, and Apache Hudi for open table format dominance |
| UniForm | Databricks feature allowing Delta Lake tables to be read as Iceberg or Hudi tables |
| DeltaSharing | Open protocol for securely sharing live data across organizations without copying |
| Apache Arrow | In-memory columnar data format enabling zero-copy reads across different runtimes |
| Apache Arrow Flight | RPC framework for high-speed data transfer using Arrow format |
| ADBC | Arrow Database Connectivity — Arrow-native replacement for JDBC/ODBC for analytical workloads |
| Ibis | Python dataframe library with multiple backends (DuckDB, BigQuery, Spark) using the same API |
| DuckDB | Embeddable in-process OLAP database; fast analytics on local files without a server |
| MotherDuck | Serverless cloud DuckDB with collaboration features |
| Data Clean Room | Secure environment where multiple parties can analyze combined data without exposing raw data |
| Privacy-Enhancing Technologies | PETs — techniques (differential privacy, homomorphic encryption, secure MPC) for analyzing sensitive data safely |
| Differential Privacy | Mathematical framework adding calibrated noise to query results to protect individual privacy |
| Federated Learning | ML training across distributed data without centralizing the data; preserves privacy |
| Synthetic Data | Artificially generated data statistically similar to real data; used for testing and privacy compliance |
| Data Tokenomics | Emerging concept of data as an economic asset with pricing, ownership, and exchange mechanisms |
| DataOps Manifesto | Published principles for applying DevOps practices to data management |
| Continuous Integration (Data) | Automatically testing data pipeline code changes and data quality before merging to production |
| Continuous Delivery (Data) | Automatically deploying validated data pipeline changes to production environments |
| Infrastructure as Code (Data) | Managing data infrastructure, pipelines, and schemas through version-controlled code |
| Data Version Control | DVC — open-source tool for versioning datasets and ML models alongside code |
| dbt Cloud | Managed cloud service for running dbt transformations with scheduling and CI/CD |
| Git-based Workflows | Managing data pipeline code with Git for version control, branching, and code review |
| Blue-Green Deployment (Data) | Maintaining two identical environments to enable zero-downtime pipeline deployments |
| Canary Deployment (Data) | Gradually rolling out pipeline changes to a subset of users/data before full rollout |
| Schema Migration | Versioned, automated process for evolving database schemas in a controlled way |
| Data Incident | Event where data quality, availability, or security degrades below acceptable thresholds |
| SRE for Data | Applying Site Reliability Engineering principles to data systems for improved reliability |
| Data Engineering | Discipline focused on designing, building, and maintaining data infrastructure and pipelines |
| Analytics Engineering | Discipline at the intersection of data engineering and analysis; owns the transformation layer (dbt) |
| Data Platform Engineering | Building self-serve internal data infrastructure and tooling for data teams |
| Data as Code | Treating data assets (schemas, pipelines, quality rules) with software engineering rigor |