Data Architect Glossary¶

Exhaustive reference of terms, acronyms, and concepts across all Data Architecture domains.

1. Storage & Architecture Patterns¶

Term	Full Form / Description
EDW	Enterprise Data Warehouse — centralized repository for integrated, historical data from across the organization, optimized for reporting and analytics
ODS	Operational Data Store — near-real-time integration layer that consolidates data from source systems for operational reporting; not a replacement for EDW
Data Lake	Storage repository holding vast amounts of raw data in its native format (structured, semi-structured, unstructured) until needed; schema-on-read
Data Lakehouse	Architecture combining the low-cost storage of a data lake with the structure and ACID transactions of a data warehouse (e.g., Delta Lake, Apache Iceberg)
Data Warehouse	Subject-oriented, integrated, time-variant, non-volatile collection of data in support of management decisions (Inmon's definition)
Data Mart	A subset of a data warehouse focused on a single subject area or business unit (e.g., Sales Mart, Finance Mart)
Data Hub	Central integration point for data exchange between systems; often includes a metadata registry and routing logic
Data Mesh	Decentralized sociotechnical architecture where domain teams own and serve their data as products, with federated governance
Data Fabric	Architecture that provides consistent, automated data management across hybrid and multi-cloud environments using metadata and AI
Lambda Architecture	Hybrid batch + speed processing architecture; combines a batch layer (high latency, accurate) with a speed layer (low latency, approximate)
Kappa Architecture	Simplified streaming-only architecture that eliminates the batch layer; all data processed as streams
Medallion Architecture	Layered data organization: Bronze (raw), Silver (cleaned/conformed), Gold (business-level aggregates); popularized by Databricks
Bronze Layer	Raw ingestion layer in medallion architecture; data lands as-is from source systems
Silver Layer	Cleaned, deduplicated, and conformed data layer; applies data quality rules and joins
Gold Layer	Business-level aggregates and curated datasets ready for BI consumption
HTAP	Hybrid Transactional/Analytical Processing — systems that support both OLTP and OLAP workloads simultaneously
OLTP	Online Transactional Processing — systems optimized for high-volume, low-latency read/write operations (e.g., order entry systems)
MPP	Massively Parallel Processing — distributes query execution across many nodes simultaneously; used in Redshift, Snowflake, BigQuery
SMP	Symmetric Multiprocessing — multiple processors sharing a single memory space; limits scalability compared to MPP
Shared-Nothing Architecture	Each node in a cluster has its own CPU, memory, and disk; no resource contention; basis for most MPP systems
Shared-Disk Architecture	Nodes share storage but have independent CPUs/memory; common in cloud-native warehouses like Snowflake
Data Vault	Modeling methodology for enterprise data warehouses; highly scalable, auditable, and adaptable (see Data Modeling section)
Operational Analytics	Running analytics directly on operational/transactional data with minimal latency
Logical Data Warehouse	Virtual data warehouse layer that federates queries across multiple physical data stores without moving data
Virtual Data Warehouse	Query federation layer that presents a unified schema across disparate sources
Data Virtualization	Technology that provides real-time, unified data access across disparate systems without physical data movement
CDP	Customer Data Platform — system that creates a unified, persistent customer database accessible to other systems
MDM	Master Data Management — discipline for defining and managing critical data assets (customers, products, locations) to ensure single authoritative version
Reference Data	Data that defines valid values for other data fields (e.g., country codes, currency codes, status values)
Landing Zone	Initial staging area where raw data is deposited before processing; often synonymous with Bronze layer
Staging Area	Temporary storage area used during ETL for intermediate transformations before loading to target
Cold Storage	Low-cost, high-latency storage tier for infrequently accessed archival data
Hot Storage	High-cost, low-latency storage tier for frequently accessed, active data
Warm Storage	Middle-tier storage with moderate cost and access latency
Polyglot Persistence	Using multiple database technologies (relational, document, graph, key-value) best suited to each workload within one system
Event Store	Append-only log of all state-changing events in a system; basis for event sourcing pattern
Time-Series Database	Optimized for storing and querying timestamped data points (e.g., InfluxDB, TimescaleDB)
Graph Database	Stores data as nodes and edges to represent and traverse relationships (e.g., Neo4j, Amazon Neptune)
Document Store	Stores data as JSON/BSON documents with flexible schema (e.g., MongoDB, Couchbase)
Key-Value Store	Simplest NoSQL model; stores data as key-value pairs (e.g., Redis, DynamoDB)
Wide-Column Store	Stores data in rows with dynamic columns; suited for time-series and sparse data (e.g., Cassandra, HBase)
NewSQL	Databases providing ACID guarantees of traditional RDBMS with horizontal scalability of NoSQL (e.g., CockroachDB, Spanner)
In-Memory Database	Stores data primarily in RAM for ultra-low latency (e.g., Redis, SAP HANA, MemSQL)
Data Swamp	A poorly governed data lake where data quality, lineage, and discoverability have broken down
Unified Namespace	Single logical namespace that abstracts all data sources for query purposes
Reverse ETL	Process of syncing data from a data warehouse back into operational systems (CRM, marketing tools)
Zero-Copy Cloning	Creating a metadata-only copy of a dataset without duplicating physical storage (Snowflake feature)
Data Sharing	Capability to share live data across accounts or organizations without copying (Snowflake, Databricks Delta Sharing)
External Table	Table definition pointing to data stored outside the database (e.g., files in S3)
Federated Query	Querying data across multiple heterogeneous systems from a single SQL interface

2. Data Modeling¶

Term	Full Form / Description
Kimball	Ralph Kimball's dimensional modeling methodology; centers on fact and dimension tables optimized for business user queries
Inmon	Bill Inmon's enterprise data warehouse approach; normalized 3NF central warehouse with subject-area data marts fed from it
Data Vault 2.0	DV2.0 — Dan Linstedt's hybrid modeling approach with Hubs (business keys), Links (relationships), and Satellites (context); agile and auditable
Anchor Modeling	Ultra-normalized modeling approach using anchors, attributes, ties, and knots; handles change well but complex to query
Activity Schema	Modern analytics modeling pattern organizing all data around a single entity timeline
Dimensional Model	Fact-and-dimension schema design for analytical databases; optimized for query performance and business usability
Fact Table	Central table in a dimensional model storing numeric measures and foreign keys to dimensions (e.g., Sales Fact)
Dimension Table	Descriptive attributes table providing context for facts (e.g., Date, Customer, Product dimensions)
Star Schema	Dimensional model with a central fact table surrounded by denormalized dimension tables; simple and fast to query
Snowflake Schema	Normalized variant of star schema where dimension tables are further normalized into sub-dimensions
Galaxy Schema	Multiple fact tables sharing dimension tables; also called Fact Constellation
Bridge Table	Resolves many-to-many relationships between fact and dimension tables
Degenerate Dimension	Dimension attribute stored directly in the fact table rather than a separate dimension table (e.g., invoice number)
Junk Dimension	Combines low-cardinality miscellaneous flags/indicators into a single dimension to reduce fact table width
Role-Playing Dimension	Single dimension table used multiple times in a fact table for different purposes (e.g., Date as Order Date, Ship Date)
Outrigger	Secondary dimension table attached to a primary dimension table (not directly to fact); used sparingly
Conformed Dimension	Dimension shared across multiple fact tables or data marts with consistent meaning and values
Conformed Fact	Fact measure with consistent definition and granularity across data marts
Grain	The level of detail represented by a single row in a fact table; must be declared before design begins
Factless Fact Table	Fact table with no numeric measures; captures events or coverage relationships (e.g., student enrollment)
Accumulating Snapshot	Fact table pattern tracking the lifecycle of a process with multiple date stamps updated as milestones are reached
Periodic Snapshot	Fact table capturing state at regular intervals (daily, weekly, monthly)
Transaction Fact Table	Records individual business events or transactions at the lowest grain
SCD	Slowly Changing Dimension — technique for managing changes to dimension attributes over time
SCD Type 0	Dimension attributes never change; historical value is retained forever
SCD Type 1	Overwrite old value; no history kept; current state only
SCD Type 2	Add a new row for each change; full history preserved with effective dates and current flag
SCD Type 3	Add a new column for the previous value; limited history (typically tracks only one prior value)
SCD Type 4	Separate history table; current values in main table, all history in mini-dimension
SCD Type 6	Hybrid combining Types 1, 2, and 3; adds current value column to Type 2 rows
Hub	Data Vault component storing a unique list of business keys with metadata (load date, source)
Link	Data Vault component capturing relationships between two or more Hubs
Satellite	Data Vault component storing descriptive context and history for a Hub or Link
PIT Table	Point-in-Time table in Data Vault — pre-joins satellites at specific snapshots for query performance
Bridge Table (DV)	Data Vault construct that pre-joins a chain of links and hubs for performance
Business Vault	Data Vault layer containing business rules and calculations applied to Raw Vault data
Raw Vault	Data Vault layer containing data as-received from sources, no business rules applied
3NF	Third Normal Form — relational database design where all attributes depend only on the primary key; eliminates redundancy
1NF	First Normal Form — all column values are atomic (indivisible), no repeating groups
2NF	Second Normal Form — 1NF plus all non-key attributes fully depend on the entire composite primary key
BCNF	Boyce-Codd Normal Form — stronger version of 3NF; every determinant is a candidate key
Denormalization	Intentionally adding redundancy to a normalized schema to improve read query performance
ERD	Entity-Relationship Diagram — visual representation of entities and their relationships in a data model
Conceptual Model	High-level model showing key entities and relationships; no technical detail; used for stakeholder communication
Logical Model	Detailed data model with entities, attributes, and relationships; technology-agnostic
Physical Model	Implementation-specific model with tables, columns, data types, indexes, and constraints
Surrogate Key	System-generated artificial primary key (integer or UUID) assigned to dimension rows
Natural Key	Business-assigned identifier that has meaning outside the database (e.g., customer ID, SSN)
Composite Key	Primary key made up of two or more columns
Foreign Key	Column(s) referencing the primary key of another table to enforce referential integrity
Business Key	Identifier used in the business domain to uniquely identify an entity; basis for Data Vault Hubs
Cardinality	The number of unique values in a column; also describes relationship types (1:1, 1:M, M:N)
Granularity	Level of detail in a dataset; fine grain = more rows, each representing a smaller unit of measurement
Normalization	Process of organizing data to reduce redundancy and improve data integrity
Data Type	Classification of a column's values (INTEGER, VARCHAR, DATE, BOOLEAN, etc.)
Null Handling	How missing/unknown values are represented and treated in queries and aggregations
Temporal Table	Table that automatically tracks row history with valid-time or transaction-time columns; ISO SQL:2011 standard
Bi-Temporal Modeling	Tracking both valid time (when something was true in reality) and transaction time (when it was recorded)
Polymorphic Association	Single table stores relationships to multiple entity types; common anti-pattern in relational modeling
Anti-Pattern	A modeling or design choice that seems reasonable but causes problems (e.g., EAV for structured data)
EAV	Entity-Attribute-Value — stores data as rows of key-value pairs; flexible but hard to query and validate
Wide Table	Denormalized table with many columns; common in analytics/columnar stores for query performance
Schema Evolution	The ability to change a data schema (add/remove columns) without breaking existing consumers
Semantic Key	A meaningful business key embedded in a surrogate; combines audibility and performance
Hash Key	MD5 or SHA-1 hash of business key fields used as surrogate in Data Vault for deterministic, parallel loading

3. Integration & Processing¶

Term	Full Form / Description
ETL	Extract, Transform, Load — data integration pattern where transformation occurs before loading into target
ELT	Extract, Load, Transform — data is loaded raw into target system and transformed there using its compute power
CDC	Change Data Capture — technique for identifying and capturing data changes in source systems (insert, update, delete)
Log-Based CDC	CDC using database transaction logs (WAL, binlog) to capture changes without impacting source performance
Query-Based CDC	CDC using timestamps or watermarks in SQL queries to detect changed rows; higher source impact
Trigger-Based CDC	CDC using database triggers to capture changes; high overhead, generally avoided
Batch Processing	Processing data in discrete, scheduled chunks; high latency, high throughput
Micro-Batch	Near-real-time processing of small batches at short intervals (e.g., Spark Structured Streaming)
Stream Processing	Continuous processing of data as it arrives with very low latency (e.g., Apache Flink, Kafka Streams)
Real-Time Processing	Data processing with sub-second latency; used for alerts, fraud detection, live dashboards
Near-Real-Time	Processing with latency of seconds to minutes; acceptable for many operational analytics use cases
Data Pipeline	Series of processing steps that move and transform data from source to destination
Ingestion	Process of bringing data into a storage system from external sources
Message Queue	Asynchronous communication buffer between systems (e.g., RabbitMQ, Amazon SQS)
Event Streaming	Durable, ordered log of events accessible for replay by multiple consumers (e.g., Apache Kafka)
Kafka	Apache Kafka — distributed event streaming platform; uses topics, partitions, producers, and consumers
Pub/Sub	Publish-Subscribe messaging pattern where producers publish to topics and consumers subscribe independently
Topic	Named channel in a messaging system where events are published and consumed
Partition	Subdivision of a Kafka topic for parallelism and ordering guarantees
Consumer Group	Set of consumers that collectively read all partitions of a topic; enables parallel consumption
Exactly-Once Semantics	Processing guarantee that each message is processed exactly once, even in failure scenarios
At-Least-Once	Processing guarantee that a message will be processed at minimum once; duplicates possible
At-Most-Once	Processing guarantee that a message is processed no more than once; data loss possible
Idempotency	Property of an operation that can be applied multiple times without changing the result beyond the first application
Backpressure	Mechanism for a downstream system to signal an upstream system to slow data production
Watermark	In stream processing, a threshold indicating how late events can arrive and still be included in a window
Event Time	The time when an event actually occurred in the real world
Processing Time	The time when an event is processed by the stream processor
Tumbling Window	Fixed-size, non-overlapping time window for stream aggregations
Sliding Window	Overlapping time windows that advance by a step smaller than the window size
Session Window	Dynamic window that groups events within a period of activity, closing after a gap of inactivity
Late Arriving Data	Events that arrive after the expected processing window; requires special handling strategies
Upsert	Operation that inserts a new record or updates an existing one based on a key match (UPDATE + INSERT)
Merge	SQL/DML operation combining INSERT, UPDATE, and DELETE in a single statement based on match conditions
Full Refresh	Loading strategy that truncates and reloads an entire table; simple but expensive for large datasets
Incremental Load	Loading only new or changed records since the last extraction; requires reliable watermarking
Delta Load	Synonym for incremental load; loading only the "delta" (changes) since the last run
Data Replication	Copying data from one system to another to ensure availability, redundancy, or geographic distribution
Webhook	HTTP callback that pushes data to a URL when an event occurs; event-driven integration pattern
API Integration	Connecting systems via REST, SOAP, or GraphQL APIs to exchange data
REST	Representational State Transfer — stateless HTTP-based API design style; uses GET, POST, PUT, DELETE
GraphQL	Query language for APIs that allows clients to request exactly the data they need
gRPC	Google Remote Procedure Call — high-performance RPC framework using Protocol Buffers
Protocol Buffers	Protobuf — Google's binary serialization format; more efficient than JSON/XML
Avro	Apache Avro — compact binary data serialization format with schema evolution support; common in Kafka
Parquet	Apache Parquet — columnar binary storage format; efficient for analytical queries; default in many lake formats
ORC	Optimized Row Columnar — columnar format developed for Hive; includes built-in indexes and statistics
JSON	JavaScript Object Notation — lightweight, human-readable data interchange format
CSV	Comma-Separated Values — plain-text tabular format; ubiquitous but lacks schema enforcement
XML	Extensible Markup Language — hierarchical text-based format; verbose but self-describing
DAG	Directed Acyclic Graph — representation of pipeline dependencies; used by Airflow, dbt, and other orchestrators
Orchestration	Coordinating and scheduling the execution of pipeline tasks and dependencies
Airflow	Apache Airflow — open-source workflow orchestration platform using Python DAGs
Prefect	Modern Python-native orchestration platform with dynamic workflows
Dagster	Data-aware orchestration platform with built-in asset tracking and lineage
dbt	Data Build Tool — SQL-based transformation framework for the ELT pattern; version-controls SQL models
Fivetran	Managed connector service for automated data ingestion from SaaS and database sources
Airbyte	Open-source data integration platform with a large connector catalog
Singer	Open-source data integration specification using taps (sources) and targets (destinations)
Debezium	Open-source CDC platform that captures database changes via transaction logs
Flink	Apache Flink — distributed stream processing framework with exactly-once guarantees
Spark	Apache Spark — unified analytics engine for large-scale batch and streaming data processing
Spark Streaming	Micro-batch streaming layer on Apache Spark
Structured Streaming	Spark's continuous streaming API built on top of Spark SQL
Beam	Apache Beam — unified programming model for batch and streaming pipelines; runs on multiple runners
Dataflow	Google Cloud Dataflow — managed Apache Beam service
Kinesis	Amazon Kinesis — managed real-time data streaming service
Event Hub	Azure Event Hubs — fully managed event ingestion service; Kafka-compatible
Pub/Sub (GCP)	Google Cloud Pub/Sub — managed messaging and ingestion service
SQS	Amazon Simple Queue Service — managed message queuing service
SNS	Amazon Simple Notification Service — managed pub/sub messaging for fan-out patterns
Data Contracts	Formal agreement between data producers and consumers defining schema, quality, and SLA expectations
Schema-on-Read	Schema is applied when data is read, not when it is written; enables flexible ingestion
Schema-on-Write	Schema is enforced when data is written; ensures data quality at the point of ingestion
Data Serialization	Converting data structures to a format suitable for storage or transmission
Compression	Reducing data size using algorithms (GZIP, Snappy, LZ4, ZSTD); critical for storage cost and I/O performance
Partitioning	Dividing data into logical segments (by date, region, etc.) to improve query performance and data management
Bucketing	Sub-partitioning data into fixed buckets by hash of a column; improves join and aggregation performance
Z-Ordering	Multi-dimensional clustering technique (Delta Lake) that co-locates related data to reduce query I/O
Clustering	Physically organizing data on disk by one or more columns to improve range query performance
Pushdown Predicate	Filtering applied at the storage layer before data is sent to the compute layer; reduces data scan
Data Skew	Uneven distribution of data across partitions causing some tasks to run much longer than others
Shuffle	Redistribution of data across partitions during operations like joins and aggregations in distributed systems
Broadcast Join	Join optimization where a small table is replicated to all nodes to avoid shuffling a large table

4. Metadata & Data Governance¶

Term	Full Form / Description
DAMA	Data Management Association — professional organization that publishes the DMBOK framework
DMBOK	Data Management Body of Knowledge — DAMA's comprehensive framework for data management disciplines
Data Governance	Framework of policies, processes, standards, and roles that ensure data is managed as a strategic asset
Data Steward	Person responsible for the quality and fitness of a specific data domain or dataset
Data Owner	Business executive accountable for the quality, security, and appropriate use of a data asset
Data Custodian	IT role responsible for technical management and storage of data assets
Data Catalog	Metadata repository providing searchable inventory of data assets with descriptions, lineage, and quality info
Business Glossary	Curated dictionary of business terms with agreed definitions; foundation for data governance
Data Dictionary	Technical documentation of datasets, tables, and columns including types, constraints, and descriptions
Data Lineage	End-to-end tracking of data origin, movement, transformations, and consumption
Impact Analysis	Using lineage to assess downstream effects of a proposed schema or pipeline change
Active Metadata	Metadata that drives automated decisions and actions (e.g., triggering quality checks, routing data)
Passive Metadata	Metadata used for documentation and discovery but not for automation
Technical Metadata	Information about data structure, format, storage, and access (table schemas, file sizes, partitions)
Business Metadata	Context describing business meaning, ownership, and usage policies
Operational Metadata	Information about data pipelines, job runs, data volumes, and processing history
Data Classification	Categorizing data by sensitivity level (Public, Internal, Confidential, Restricted)
PII	Personally Identifiable Information — any data that can identify a specific individual (name, SSN, email)
PHI	Protected Health Information — health data protected under HIPAA regulations
PCI DSS	Payment Card Industry Data Security Standard — security standard for handling cardholder data
GDPR	General Data Protection Regulation — EU regulation governing personal data collection and processing
CCPA	California Consumer Privacy Act — California law giving consumers rights over personal data
HIPAA	Health Insurance Portability and Accountability Act — US law protecting medical information
Data Residency	Requirement that data be stored and processed within specific geographic boundaries
Data Sovereignty	Legal concept that data is subject to the laws of the country where it is collected or stored
Right to be Forgotten	GDPR right allowing individuals to request deletion of their personal data
Data Minimization	Principle of collecting only the minimum data necessary for a stated purpose
Purpose Limitation	Principle that data collected for one purpose should not be used for a different purpose
Consent Management	Systems and processes for capturing, storing, and enforcing user consent for data processing
Data Retention Policy	Rules governing how long data is kept before archival or deletion
Data Lifecycle Management	DLM — governing data from creation through archival and deletion
Data Access Control	Policies and mechanisms controlling who can read, write, or modify specific data assets
RBAC	Role-Based Access Control — granting data access based on user roles rather than individual identities
ABAC	Attribute-Based Access Control — access decisions based on attributes of users, resources, and environment
Column-Level Security	Restricting access to specific columns in a table based on user role or attribute
Row-Level Security	RLS — filtering rows returned to a user based on their identity or role
Dynamic Data Masking	Masking sensitive data in query results without changing the stored data
Tokenization	Replacing sensitive data values with non-sensitive tokens; original value stored in a secure vault
Encryption at Rest	Encrypting data while stored on disk; protects against physical media theft
Encryption in Transit	Encrypting data as it moves over networks (TLS/SSL); protects against interception
Key Management	Managing cryptographic keys for encryption; critical for key rotation and access control
Audit Trail	Immutable log of all data access and modification events for compliance and forensic purposes
Data Trust Score	Metric quantifying the reliability and quality of a dataset based on multiple quality dimensions
Data Policy	Formal rules governing data collection, use, storage, sharing, and disposal
Data Standard	Agreed specifications for data formats, definitions, coding, and quality rules
Interoperability	Ability of different systems to exchange and use data without special integration effort
Data Portability	Ability to transfer data from one system to another in a usable format
Metadata Management	Discipline for collecting, storing, maintaining, and using metadata to improve data usability
Apache Atlas	Open-source metadata management and governance platform for Hadoop ecosystems
Collibra	Enterprise data governance and catalog platform
Alation	AI-based data catalog and governance platform
Atlan	Modern collaborative data catalog with active metadata capabilities
DataHub	LinkedIn's open-source metadata platform for data discovery and lineage
OpenMetadata	Open-source metadata platform and data catalog
Unity Catalog	Databricks' unified governance layer for data and AI assets
Informatica IDMC	Informatica Intelligent Data Management Cloud — enterprise data management platform
Waterline Data	AI-driven data discovery and cataloging tool (acquired by Infogix)
Data Contract	Formal schema and quality agreement between data producers and consumers
Schema Registry	Central repository for managing and versioning event schemas (Confluent, AWS Glue)
Data Mesh Governance	Federated computational governance in Data Mesh; policies enforced via automated platforms, not central teams
FAIR Principles	Findable, Accessible, Interoperable, Reusable — principles for scientific data management

5. Semantic Layer & Ontology¶

Term	Full Form / Description
Semantic Layer	Abstraction that translates business terms into technical data queries; sits between data and BI tools
Metric Layer	Centralized definition of business metrics (revenue, churn) ensuring consistent calculation everywhere
Headless BI	Metric definitions and business logic decoupled from any specific BI tool; accessible via API
Metrics Store	Repository of defined, versioned, and governed business metrics (e.g., dbt Metrics, Cube.js)
Ontology	Formal representation of knowledge including concepts, categories, and relationships in a domain
Knowledge Graph	Graph structure that represents real-world entities and semantic relationships between them
RDF	Resource Description Framework — W3C standard for representing information as subject-predicate-object triples
RDFS	RDF Schema — vocabulary extension for RDF providing class and property hierarchies
OWL	Web Ontology Language — W3C standard for creating rich ontologies on top of RDF
SPARQL	SPARQL Protocol and RDF Query Language — query language for RDF data stores
Triple	Atomic unit of RDF data: subject (entity) + predicate (relationship) + object (entity or value)
Named Graph	RDF graph with an associated URI; allows managing and querying subsets of a triple store
Linked Data	Practice of using RDF and HTTP URIs to publish and connect structured data on the web
SKOS	Simple Knowledge Organization System — W3C standard for expressing controlled vocabularies
Taxonomy	Hierarchical classification system for organizing concepts into parent-child relationships
Thesaurus	Vocabulary of preferred terms with synonyms, broader/narrower terms, and related terms
Controlled Vocabulary	Standardized set of terms used for consistent tagging and classification
URI	Uniform Resource Identifier — globally unique identifier for resources in linked data and the web
Property Graph	Graph model where nodes and edges can have properties; used by Neo4j, TigerGraph
Cypher	Declarative query language for property graphs; used by Neo4j
Gremlin	Graph traversal language for property graphs; part of Apache TinkerPop
Triple Store	Database optimized for storing and querying RDF triples (e.g., Stardog, GraphDB, Amazon Neptune)
Inference / Reasoning	Deriving new facts from existing knowledge using ontology rules (e.g., OWL reasoning)
Semantic Search	Search that understands the meaning and context of queries rather than just keyword matching
Entity Resolution	Process of identifying when records from different sources refer to the same real-world entity
Entity Extraction	NLP technique for identifying named entities (people, places, organizations) in unstructured text
Knowledge Representation	Formal encoding of domain knowledge for automated reasoning
dbt Semantic Layer	dbt's implementation of a metric layer allowing consistent metric definitions across BI tools
LookML	Looker's proprietary data modeling language for defining metrics, dimensions, and explores
AtScale	Semantic layer and universal data model platform
Cube.js	Open-source analytical API platform with a metric/semantic layer
Superset Semantic Layer	Apache Superset's virtual dataset layer for metric definitions
Open Metadata Standards	Efforts like OpenLineage, OpenMetadata, and dbt Contracts to standardize metadata exchange
Conceptual Schema	Technology-agnostic representation of business concepts and their relationships
Domain Model	Model representing concepts, relationships, and rules within a specific business domain
Fact vs. Dimension (semantic)	Semantic layer concepts mapping to numerical metrics (facts) vs. descriptive attributes (dimensions)

6. BI & Analytics¶

Term	Full Form / Description
OLAP	Online Analytical Processing — multi-dimensional analysis of business data for decision support
ROLAP	Relational OLAP — OLAP implemented on relational databases using SQL; no pre-aggregated cubes
MOLAP	Multidimensional OLAP — data stored in pre-aggregated multidimensional arrays; fast queries, less flexible
HOLAP	Hybrid OLAP — combines ROLAP and MOLAP; aggregates in multidimensional store, details in relational
Cube	Multi-dimensional data structure with pre-calculated aggregates along dimensions and hierarchies
MDX	MultiDimensional Expressions — query language for OLAP cubes (Microsoft Analysis Services, Essbase)
DAX	Data Analysis Expressions — formula language used in Power BI, Power Pivot, and SSAS Tabular models
Power Query / M	Microsoft's data transformation language and engine embedded in Power BI and Excel
Measure	Numeric value calculated in a BI context (sum, average, count); can be implicit or explicit
Dimension (BI)	Categorical attribute used to slice and filter measures (Time, Geography, Product)
Hierarchy	Ordered levels within a dimension enabling drill-down (Year > Quarter > Month > Day)
KPI	Key Performance Indicator — quantifiable metric reflecting critical business objectives
OKR	Objectives and Key Results — goal-setting framework pairing aspirational goals with measurable outcomes
Drill-Down	Navigating from a summary level to more detailed data (e.g., Year → Quarter → Month)
Drill-Up / Roll-Up	Aggregating detail data to a higher summary level
Drill-Through	Accessing underlying transaction-level detail records from a summarized BI view
Slice	Filtering a cube on one dimension to a specific value
Dice	Filtering a cube on multiple dimensions simultaneously
Pivot	Rotating a data table to exchange rows and columns for different analytical perspectives
Cross-Tab	Cross-tabulation — matrix display with row and column totals; pivot table equivalent
Scorecard	Dashboard presenting KPIs against targets, typically using RAG (Red/Amber/Green) status indicators
Dashboard	Visual display of key metrics and data visualizations for at-a-glance monitoring
Report	Structured presentation of data with filtering and formatting; less interactive than a dashboard
Ad Hoc Query	Unscheduled, user-defined query created on-demand to answer a specific business question
LOD	Level of Detail — Tableau expression type (FIXED, INCLUDE, EXCLUDE) for controlling aggregation scope
Table Calculation	Tableau calculations performed on the result set after aggregation, not on underlying data
Calculated Field	User-defined computed column or measure within a BI tool
Aggregation	Summarizing multiple rows into a single value (SUM, COUNT, AVG, MIN, MAX)
Running Total	Cumulative sum of a measure over an ordered dimension (e.g., year-to-date revenue)
YTD	Year-to-Date — cumulative value from the start of the fiscal/calendar year to current date
MTD	Month-to-Date — cumulative value from the start of the current month
QTD	Quarter-to-Date — cumulative value from the start of the current quarter
Period-over-Period	Comparing a metric in one period to the same metric in a prior period (YoY, MoM, WoW)
Time Intelligence	BI functions for date-based calculations (SAMEPERIODLASTYEAR, DATEADD in DAX)
Parameterized Report	Report with user-defined input parameters that filter or modify the output
Paginated Report	Fixed-layout, print-optimized report; suitable for invoices, statements, pixel-perfect output
Self-Service BI	BI tools enabling business users to create their own reports and analyses without IT assistance
Augmented Analytics	AI/ML-powered analytics that automates insight discovery, data preparation, and explanation
Natural Language Query	NLQ — querying data using plain language questions; enabled by NLP in BI tools
Embedded Analytics	BI capabilities integrated directly into business applications rather than standalone tools
Pixel-Perfect Reporting	Reports with precise layout control for printing; contrasted with interactive dashboards
Semantic Model	In Power BI, the dataset layer that defines measures, hierarchies, relationships, and row-level security
Tabular Model	Microsoft SSAS model type storing data in-memory in columnar format for fast DAX queries
Multidimensional Model	Microsoft SSAS model using MDX cubes with pre-aggregated measures
Composite Model	Power BI model combining DirectQuery and import sources in the same dataset
DirectQuery	Power BI/Tableau mode sending live queries to the source system; no local data import
Import Mode	Power BI mode that loads data into an in-memory columnar engine (VertiPaq)
VertiPaq	In-memory columnar storage engine powering Power BI's import mode
Tableau	Popular BI and data visualization platform known for drag-and-drop exploration
Looker	Google's data platform for BI with LookML semantic modeling
Power BI	Microsoft's BI suite including Desktop, Service, and Mobile components
MicroStrategy	Enterprise BI platform known for large-scale deployments and HyperIntelligence
Qlik	BI platform with associative data model engine (QlikView, Qlik Sense)
Superset	Apache Superset — open-source BI and data visualization platform
Metabase	Open-source BI tool designed for simplicity and self-service
Redash	Open-source data visualization and querying tool
Grafana	Open-source observability and visualization platform; strong for time-series and operational metrics
A/B Testing	Controlled experiment comparing two variants (A and B) to determine which performs better
Cohort Analysis	Analyzing behavior of groups of users who share a common characteristic at a point in time
Funnel Analysis	Tracking sequential steps users take toward a conversion goal
Retention Analysis	Measuring how many users return over time after initial engagement
Attribution Modeling	Assigning credit for a conversion to different marketing touchpoints
Segmentation	Dividing users or data into groups based on shared attributes for targeted analysis

7. Cloud & Infrastructure¶

Term	Full Form / Description
Columnar Storage	Data stored by column rather than row; dramatically improves analytical query performance and compression
Row-Based Storage	Traditional RDBMS storage; efficient for OLTP (full row access) but poor for analytical (column scans)
Compression Ratio	Measure of how much data is reduced by compression; columnar formats achieve much higher ratios than row-based
Data Skipping	Using metadata (min/max statistics, bloom filters) to skip irrelevant data files during query execution
Bloom Filter	Probabilistic data structure used to test whether an element is in a set; used in Parquet and Iceberg
Zone Maps	Column-level min/max statistics stored in file metadata for data skipping optimization
Open Table Format	Specification defining how data lake files are organized and queried with ACID support
Delta Lake	Linux Foundation open table format providing ACID transactions and schema enforcement on data lakes
Apache Iceberg	Open table format for huge analytic datasets with schema evolution and time travel
Apache Hudi	Hadoop Upserts Deletes and Incrementals — open table format with CDC and incremental processing
ACID	Atomicity, Consistency, Isolation, Durability — properties guaranteeing reliable database transactions
Atomicity	Transaction property: all operations succeed or none do; no partial updates
Consistency (ACID)	Transaction property: database moves from one valid state to another
Isolation	Transaction property: concurrent transactions don't interfere with each other
Durability	Transaction property: committed transactions survive system failures
MVCC	Multi-Version Concurrency Control — allows concurrent reads and writes by maintaining multiple data versions
Time Travel	Ability to query data as it existed at a past point in time; supported by Delta Lake, Iceberg, Snowflake
Data Versioning	Tracking changes to datasets over time, enabling rollback and historical queries
Snapshot Isolation	Transaction isolation level where reads see a consistent snapshot of data at the transaction start time
Write-Ahead Log	WAL — transaction log written before data changes; enables crash recovery and CDC
Optimistic Concurrency	Allows multiple transactions to proceed without locking; checks for conflicts at commit time
Pessimistic Concurrency	Locks data resources during a transaction to prevent conflicts; reduces throughput
Manifest File	Metadata file in Iceberg/Delta that tracks which data files are part of a table snapshot
Transaction Log	Ordered record of all changes to a Delta Lake table; basis for ACID properties and time travel
Compaction	Process of merging many small files into fewer larger files to improve read performance
Small File Problem	Performance issue where too many small files cause excessive metadata overhead and slow queries
Auto-Optimize	Databricks feature that automatically compacts small files
Liquid Clustering	Databricks' dynamic, incremental replacement for static partition-based clustering
Copy-on-Write	Table format update strategy: rewrites entire data files on every update; better for read-heavy workloads
Merge-on-Read	Table format update strategy: writes delta files on update; merges at read time; better for write-heavy workloads
Serverless	Computing model where infrastructure management is abstracted away; auto-scales and billed per use
Elastic Scaling	Automatically adding or removing compute resources based on workload demand
Separation of Storage and Compute	Architecture where data storage and query compute scale independently; foundational to Snowflake, BigQuery
Virtual Warehouse	Snowflake's independent compute cluster that can be sized and scaled independently
Slot	BigQuery's unit of computational capacity; reserved or on-demand pricing
Redshift	Amazon Redshift — managed MPP data warehouse on AWS
BigQuery	Google BigQuery — serverless, multi-cloud data warehouse
Snowflake	Cloud data platform with separated storage and compute; supports multi-cloud deployment
Synapse Analytics	Azure Synapse Analytics — integrated analytics service combining data warehousing and big data
Databricks	Unified analytics platform built on Apache Spark; pioneered the Lakehouse concept
S3	Amazon Simple Storage Service — object storage; de facto standard for data lake storage
ADLS	Azure Data Lake Storage — Microsoft's scalable object storage for analytics workloads
GCS	Google Cloud Storage — Google's object storage service
Object Storage	Storage model using objects with metadata and unique IDs; highly scalable, no hierarchy
Block Storage	Raw storage volumes (EBS, Azure Disks); low latency, used for databases
File Storage	Hierarchical filesystem-based storage (EFS, Azure Files, NFS)
Data Transfer Cost	Cloud charges for moving data between regions, availability zones, or out of cloud
Egress Cost	Charges for data transferred out of a cloud provider's network
Reserved Capacity	Pre-purchasing cloud compute/storage at discounted rates vs. on-demand pricing
Spot/Preemptible Instance	Unused cloud capacity sold at steep discount; can be interrupted; used for fault-tolerant batch jobs
VPC	Virtual Private Cloud — isolated network environment within a cloud provider
Private Link	Direct private network connection between services without traversing the public internet
IAM	Identity and Access Management — cloud service for managing authentication and authorization
Service Account	Non-human identity used by applications and services for authentication to cloud APIs
Encryption Key Management	KMS — managed service for creating and controlling encryption keys (AWS KMS, Azure Key Vault, GCP KMS)
Data Plane	The infrastructure layer that processes and stores actual data
Control Plane	The infrastructure layer managing metadata, configuration, and orchestration
Multi-Cloud	Strategy of using services from multiple cloud providers to avoid lock-in or optimize cost/capability
Hybrid Cloud	Combining on-premises infrastructure with public cloud services
Data Gravity	Tendency for data to attract applications and services; moving large datasets is expensive
Vendor Lock-In	Dependency on a single vendor's proprietary technology that makes migration costly
Open Standards	Non-proprietary specifications enabling interoperability (Parquet, Iceberg, OpenTelemetry)
Cost Allocation Tags	Cloud resource tags used for billing attribution by team, project, or environment
FinOps	Financial Operations — practice of managing cloud costs and optimizing cloud financial management
Infrastructure as Code	IaC — managing infrastructure through version-controlled configuration files (Terraform, Pulumi)
Terraform	HashiCorp's open-source IaC tool for provisioning cloud infrastructure
Docker	Container platform enabling consistent application packaging and deployment
Kubernetes	K8s — container orchestration system for automating deployment, scaling, and management
Helm	Package manager for Kubernetes applications
CI/CD	Continuous Integration/Continuous Delivery — automated build, test, and deployment pipelines

8. Data Quality & Observability¶

Term	Full Form / Description
Data Quality	Degree to which data is fit for its intended use; measured across multiple dimensions
DQ Dimensions	Standard aspects of data quality: completeness, accuracy, consistency, timeliness, validity, uniqueness
Completeness	DQ dimension: percentage of required fields that contain values; no unexpected nulls
Accuracy	DQ dimension: data correctly reflects the real-world entity or event it describes
Consistency (DQ)	DQ dimension: data values are consistent across systems and over time
Timeliness	DQ dimension: data is available when needed and reflects current reality
Validity	DQ dimension: data conforms to defined formats, ranges, and rules
Uniqueness	DQ dimension: no unintended duplicate records exist
Integrity	DQ dimension: referential and relational integrity is maintained across datasets
Conformity	DQ dimension: data adheres to specified data standards and formats
Data Anomaly	Unexpected deviation from normal data patterns; may indicate quality issues or real events
Anomaly Detection	Automated identification of unusual patterns in data (statistical, ML-based approaches)
Data Drift	Gradual change in data distribution over time that degrades model or report accuracy
Schema Drift	Unexpected changes to a data source's schema that break downstream pipelines
Data Observability	End-to-end visibility into data health including freshness, volume, schema, distribution, and lineage
Data Reliability	Consistent availability of high-quality, trustworthy data to consumers
Freshness	How recently data was updated; critical SLA for operational and real-time analytics
Volume Anomaly	Unexpected spike or drop in row counts, signaling ingestion or source issues
Distribution Shift	Change in statistical distribution of column values over time
Null Rate	Percentage of null values in a column; tracked as a quality and freshness indicator
Duplicate Rate	Percentage of duplicate records in a dataset
Data SLA	Service Level Agreement for data products defining freshness, completeness, and availability targets
Great Expectations	Open-source Python library for defining, documenting, and validating data quality expectations
dbt Tests	dbt's built-in and extensible data testing framework (not_null, unique, accepted_values, relationships)
Soda	Data quality platform for defining and running data checks across SQL and file sources
Monte Carlo	Data observability platform using ML to detect and alert on data quality issues
Bigeye	Data observability and monitoring platform with automated anomaly detection
Acceldata	Data observability and pipeline intelligence platform
MoD	Metrics on Data — tracking operational metrics (row count, null %, freshness) on datasets over time
Data Testing	Automated validation of data against defined expectations or business rules
Unit Test (data)	Testing individual transformation logic with known inputs and expected outputs
Integration Test (data)	End-to-end test validating that the complete pipeline produces correct results
Data Reconciliation	Comparing data between source and target systems to verify completeness and accuracy of transfers
Checksum	Hash value computed over data to verify integrity during transfer or storage
Row Count Validation	Comparing record counts between source and target as a basic completeness check
Referential Integrity	Constraint ensuring foreign key values exist in the referenced primary key table
Business Rule Validation	Testing data against domain-specific rules (e.g., order amount must be positive)
Statistical Process Control	SPC — applying statistical methods to monitor and control data quality over time
Control Chart	Visualization plotting quality metrics over time with upper/lower control limits
Data Profiling	Automated analysis of data to understand structure, completeness, distribution, and relationships
Column Statistics	Min, max, mean, median, standard deviation, null count computed per column
Cardinality Check	Validating the number of distinct values in a column against expectations
Pattern Matching	Validating that values conform to expected formats using regex (e.g., email, phone)
Range Check	Validating that numeric or date values fall within expected bounds
Cross-Field Validation	Validating relationships between multiple columns (e.g., end_date >= start_date)
Golden Record	The authoritative, trusted version of a master data entity after deduplication
Data Deduplication	Process of identifying and removing duplicate records from a dataset
Fuzzy Matching	Identifying similar (not exact) records using string distance algorithms
Record Linkage	Linking records across systems that refer to the same entity without a common identifier
Probabilistic Matching	Matching records using statistical likelihood scores across multiple attributes
Data Quality Score	Composite metric summarizing overall quality of a dataset across multiple dimensions
Data Health Dashboard	Centralized view of data quality metrics and SLA compliance across data assets
Alerting	Automated notifications when data quality metrics breach defined thresholds
Root Cause Analysis	Systematic process for identifying the underlying cause of a data quality issue
Incident Management	Process for tracking, escalating, and resolving data quality incidents
Data Debugging	Tracing data issues through pipelines using lineage and observability tools

9. Emerging Concepts¶

Term	Full Form / Description
Data Product	A self-contained, discoverable, addressable, trustworthy, and interoperable data asset served by a domain team
Domain Ownership	Data Mesh principle: teams closest to the data own its quality, availability, and access
Data as a Product	Data Mesh principle: applying product thinking (SLAs, discoverability, documentation) to data assets
Self-Serve Data Platform	Data Mesh principle: infrastructure platform enabling domain teams to build and serve data products independently
Federated Computational Governance	Data Mesh principle: global policies enforced through automated platforms rather than central data teams
DataOps	Agile methodology applying DevOps principles to data engineering; emphasizes automation, collaboration, and quality
MLOps	Machine Learning Operations — practices for deploying, monitoring, and maintaining ML models in production
LLMOps	Practices for deploying and managing Large Language Model applications in production
ModelOps	Operationalizing all types of AI/ML models including statistical, ML, and deep learning models
Feature Store	Centralized repository for storing, sharing, and serving ML features for training and inference
Feature Engineering	Transforming raw data into features (input variables) that improve ML model performance
Feature Serving	Providing low-latency access to ML features for real-time model inference
Online Store	Feature store layer for low-latency feature retrieval for real-time inference (e.g., Redis)
Offline Store	Feature store layer for high-throughput feature retrieval for batch training (e.g., S3, BigQuery)
Training-Serving Skew	Discrepancy between features used in model training vs. what's served in production; major ML risk
Model Registry	Versioned repository for storing, tracking, and managing ML model artifacts
Experiment Tracking	Recording hyperparameters, metrics, and artifacts from ML training runs (MLflow, W&B)
MLflow	Open-source platform for managing the end-to-end ML lifecycle
Kubeflow	Kubernetes-native platform for deploying and managing ML workflows
Vector Database	Database optimized for storing and querying high-dimensional embedding vectors (Pinecone, Weaviate, Qdrant)
Embedding	Dense numerical vector representation of data (text, images, audio) capturing semantic meaning
RAG	Retrieval-Augmented Generation — LLM pattern that retrieves relevant context from a knowledge base before generating responses
ANN	Approximate Nearest Neighbor — algorithms for finding similar vectors efficiently (HNSW, IVF)
HNSW	Hierarchical Navigable Small World — graph-based algorithm for fast ANN search in vector databases
Cosine Similarity	Metric measuring the angle between two vectors; common similarity measure for embeddings
Fine-Tuning	Adapting a pre-trained LLM to a specific task by training on domain-specific data
Prompt Engineering	Designing effective prompts to get desired outputs from LLMs without changing model weights
LLM	Large Language Model — deep learning model trained on massive text datasets for NLP tasks (GPT-4, Claude)
Foundation Model	Large pre-trained model serving as a base that can be fine-tuned for specific tasks
Open Lakehouse	Lakehouse built on open standards (Iceberg, Delta, Parquet) without vendor lock-in
Streaming Lakehouse	Real-time streaming ingestion and processing directly on lakehouse table formats
Knowledge Fabric	Evolution of data fabric incorporating semantic knowledge graphs and ontologies
Semantic Data Fabric	Data fabric with rich semantic metadata enabling context-aware data discovery and access
Augmented Data Management	Using AI/ML to automate metadata management, data quality, and governance tasks
Generative AI for Data	Using LLMs for automated SQL generation, data documentation, anomaly explanation, and data exploration
Text-to-SQL	LLM capability to convert natural language questions into executable SQL queries
Data Contract (modern)	Machine-readable schema + quality + SLA agreement between producers and consumers; versioned and enforced
Open Data Contract Standard	ODCS — open specification for data contracts enabling interoperability between governance tools
Soda Core	Open-source data quality CLI tool supporting data contracts and checks
Streaming SQL	Writing streaming data pipelines using familiar SQL syntax (Flink SQL, ksqlDB, Apache Calcite)
ksqlDB	SQL streaming engine built on top of Apache Kafka
Materialize	Streaming database maintaining incrementally-updated materialized views from streaming sources
RisingWave	Cloud-native streaming database with PostgreSQL-compatible SQL
Unified Data Platform	Single platform combining data engineering, warehousing, analytics, and ML (Databricks, Snowflake)
Iceberg REST Catalog	Open REST API specification for interacting with Iceberg catalogs; enables multi-engine access
Apache Polaris	Open-source Iceberg catalog supporting the Iceberg REST Catalog spec
Apache Gravitino	Open-source unified metadata lake for managing metadata across heterogeneous data sources
Nessie	Open-source catalog with Git-like version control for Iceberg and Delta tables
Table Format Wars	Industry competition between Delta Lake, Apache Iceberg, and Apache Hudi for open table format dominance
UniForm	Databricks feature allowing Delta Lake tables to be read as Iceberg or Hudi tables
DeltaSharing	Open protocol for securely sharing live data across organizations without copying
Apache Arrow	In-memory columnar data format enabling zero-copy reads across different runtimes
Apache Arrow Flight	RPC framework for high-speed data transfer using Arrow format
ADBC	Arrow Database Connectivity — Arrow-native replacement for JDBC/ODBC for analytical workloads
Ibis	Python dataframe library with multiple backends (DuckDB, BigQuery, Spark) using the same API
DuckDB	Embeddable in-process OLAP database; fast analytics on local files without a server
MotherDuck	Serverless cloud DuckDB with collaboration features
Data Clean Room	Secure environment where multiple parties can analyze combined data without exposing raw data
Privacy-Enhancing Technologies	PETs — techniques (differential privacy, homomorphic encryption, secure MPC) for analyzing sensitive data safely
Differential Privacy	Mathematical framework adding calibrated noise to query results to protect individual privacy
Federated Learning	ML training across distributed data without centralizing the data; preserves privacy
Synthetic Data	Artificially generated data statistically similar to real data; used for testing and privacy compliance
Data Tokenomics	Emerging concept of data as an economic asset with pricing, ownership, and exchange mechanisms
DataOps Manifesto	Published principles for applying DevOps practices to data management
Continuous Integration (Data)	Automatically testing data pipeline code changes and data quality before merging to production
Continuous Delivery (Data)	Automatically deploying validated data pipeline changes to production environments
Infrastructure as Code (Data)	Managing data infrastructure, pipelines, and schemas through version-controlled code
Data Version Control	DVC — open-source tool for versioning datasets and ML models alongside code
dbt Cloud	Managed cloud service for running dbt transformations with scheduling and CI/CD
Git-based Workflows	Managing data pipeline code with Git for version control, branching, and code review
Blue-Green Deployment (Data)	Maintaining two identical environments to enable zero-downtime pipeline deployments
Canary Deployment (Data)	Gradually rolling out pipeline changes to a subset of users/data before full rollout
Schema Migration	Versioned, automated process for evolving database schemas in a controlled way
Data Incident	Event where data quality, availability, or security degrades below acceptable thresholds
SRE for Data	Applying Site Reliability Engineering principles to data systems for improved reliability
Data Engineering	Discipline focused on designing, building, and maintaining data infrastructure and pipelines
Analytics Engineering	Discipline at the intersection of data engineering and analysis; owns the transformation layer (dbt)
Data Platform Engineering	Building self-serve internal data infrastructure and tooling for data teams
Data as Code	Treating data assets (schemas, pipelines, quality rules) with software engineering rigor