Skip to content

Data Ingestion & Pipelines

Reference for SAs discussing data movement strategies with customers. Covers ingestion patterns, ETL vs. ELT, CDC, and the tooling landscape — with the emphasis on helping customers understand what they have and what it maps to in a modern stack.


Ingestion Pattern Overview

flowchart LR
    subgraph Sources
        DB[(Databases)]
        API[APIs / SaaS]
        FILES[Files / FTP]
        STREAM[Event Streams]
        IOT[IoT / Sensors]
    end
    subgraph Patterns
        BATCH[Batch]
        MICRO[Micro-Batch]
        STREAMING[Streaming]
        CDC[CDC]
    end
    subgraph Landing
        LAKE[Data Lake / Lakehouse\nBronze Layer]
        WAREHOUSE[Data Warehouse\nStaging]
        QUEUE[Message Queue\nKafka / Kinesis]
    end
    DB --> BATCH & CDC
    API & FILES --> BATCH
    STREAM & IOT --> STREAMING & MICRO
    BATCH & CDC --> LAKE & WAREHOUSE
    STREAMING & MICRO --> QUEUE --> LAKE

Batch Ingestion

What It Is

Data is collected and moved in discrete chunks on a schedule — nightly, hourly, or triggered by an upstream event. The workhorse of traditional enterprise data pipelines.

When It Works

  • Source systems don't support real-time extraction (legacy ERP, mainframes)
  • Downstream consumers only need fresh data daily (monthly finance reports)
  • Volume is too large to process continuously but manageable in windows

Common Tools

Tool Notes
Azure Data Factory Managed, GUI-driven, native Azure integration
AWS Glue Serverless Spark-based ETL on AWS
Apache Airflow Open-source orchestrator, highly flexible
dbt SQL-based transformation layer, not a full ETL tool
Informatica Enterprise-grade, often found in regulated industries
Talend Open-core, common in mid-market

SA Talking Points

  • Batch is not "legacy" — it is still the right choice for many workloads
  • The customer question is: "What is the acceptable data freshness for this use case?" — if daily is fine, batch is fine
  • Batch orchestration (Airflow, ADF) is often where migration complexity hides: complex DAGs, retry logic, SLA alerting

ETL vs. ELT

Traditional ETL (Extract → Transform → Load)

Data is extracted from the source, transformed in a middle processing layer (the ETL server), then loaded into the destination in a clean, ready-to-use form.

Characteristics: - Transformation happens before the data reaches the destination - Heavy reliance on middleware (Informatica, SSIS, DataStage) - Source data is often not retained in raw form - Compute lives in the ETL server, not the destination

Modern ELT (Extract → Load → Transform)

Data is extracted and loaded raw into the destination first (data lake or warehouse), then transformed using the destination's own compute.

Characteristics: - Raw data is preserved — reprocessing is always possible - Transformation uses SQL or Spark inside the target platform - Enabled by cheap cloud storage and elastic compute (BigQuery, Databricks, Snowflake) - dbt is the dominant SQL transformation layer in ELT stacks

Comparison

ETL ELT
Raw data retained Rarely Always
Compute location Middleware server Target platform
Flexibility Lower — transform design upfront Higher — re-transform as needed
Typical tools Informatica, SSIS, DataStage dbt, Spark, BigQuery, Databricks
Best for Legacy systems, regulated data masking Cloud-native, lakehouse, agile teams

SA Talking Points

  • Most modernization conversations are ETL → ELT migrations
  • "What happens when business rules change?" — in ETL, you re-engineer the pipeline; in ELT, you update a SQL model
  • Legacy ETL tools (Informatica, DataStage) are often the largest migration risk — logic buried in proprietary GUIs

Change Data Capture (CDC)

What It Is

A technique that captures only the rows that have changed in a source database (inserts, updates, deletes) rather than reloading the entire table. Enables near-real-time replication with minimal source system load.

How It Works (Database Log-Based CDC)

sequenceDiagram
    participant App as Application
    participant DB as Source DB
    participant LOG as Transaction Log\n(binlog / WAL / redo log)
    participant CDC as CDC Tool\n(Debezium / Fivetran)
    participant SINK as Target\n(Kafka / Lake / Warehouse)

    App->>DB: INSERT / UPDATE / DELETE
    DB->>LOG: Writes change record
    CDC->>LOG: Tails log continuously
    CDC->>SINK: Publishes change event

CDC Approaches

Approach Mechanism Pros Cons
Log-based Reads DB transaction log Low source impact, captures deletes Requires DB-level access
Trigger-based DB triggers write to audit table No special permissions High source DB overhead
Timestamp-based Polls for rows updated after last run Simple to implement Misses hard deletes
Full table diff Compares snapshot to previous Catches everything Expensive at scale

Common CDC Tools

Tool Notes
Debezium Open-source, Kafka-native, wide DB support
Fivetran / Airbyte Managed connectors, easy setup, less flexibility
AWS DMS Managed CDC for AWS targets
Oracle GoldenGate Enterprise, complex, expensive — common in regulated industries
Qlik Replicate (Attunity) Often found in SAP / Oracle environments

SA Talking Points

  • CDC is how customers get from "nightly batch" to "data available within minutes"
  • The hard part is not the technology — it is schema evolution: what happens when the source table changes?
  • CDC is also how you replicate SAP, Oracle, and mainframe data without impacting the source system

Streaming Ingestion

What It Is

Data is produced, captured, and processed continuously as events occur — no scheduled batch window. Events flow through a message broker (Kafka, Kinesis) and are consumed by stream processors.

Architecture

flowchart LR
    PRODUCER[Producers\nApps / Sensors / CDC] -->|events| BROKER[Message Broker\nKafka / Kinesis / Pub-Sub]
    BROKER -->|consume| PROCESSOR[Stream Processor\nSpark Streaming / Flink / Kafka Streams]
    PROCESSOR -->|write| SINK[Sinks]
    SINK --> LAKE[Lakehouse\nDelta / Iceberg]
    SINK --> SEARCH[Search / Elasticsearch]
    SINK --> CACHE[Cache / Redis]
    SINK --> BI[Real-Time Dashboard]

When Streaming Is the Right Answer

  • Fraud detection — decisions must be made in milliseconds
  • IoT / operational monitoring — sensor data needs immediate alerting
  • Customer-facing features — real-time personalization, live inventory
  • Operational analytics — operational teams need current-hour data, not yesterday's

When Streaming Is Overkill

  • Finance / compliance reporting — daily is sufficient, streaming adds cost and complexity
  • Large historical backfills — batch is more efficient
  • Sources that don't change frequently (reference data, product catalogs)

SA Talking Points

  • Streaming is expensive to build and operate — push back on "we need real-time" without a clear use case
  • Ask: "What decision changes if the data is 5 minutes old vs. 5 hours old?" — if the answer is "nothing," batch is sufficient
  • Kafka is often already in the environment (event bus for microservices) — the question is whether to tap it for analytics

Pipeline Orchestration

What It Is

The scheduling, dependency management, monitoring, and retry logic that coordinates all pipeline steps — the "control plane" of a data platform.

Key Players

Tool Model Best For
Apache Airflow Python DAGs, self-hosted or managed Complex dependencies, multi-system workflows
Databricks Workflows Native to Databricks, Jobs API Databricks-centric pipelines
Azure Data Factory GUI + ARM, managed Azure-native, less Python-savvy teams
AWS Step Functions Serverless, event-driven AWS-native, event-triggered workflows
Prefect / Dagster Modern Python orchestrators Data engineering teams, strong observability
dbt Cloud SQL transformation scheduling Analytics engineering, BI-focused teams

SA Talking Points

  • Orchestration is often where the real migration complexity lives — Airflow DAGs can have hundreds of tasks with subtle dependencies
  • "What happens when a pipeline fails at 2am?" — if the answer is unclear, observability and alerting is a gap
  • Databricks Workflows + Delta Live Tables reduces orchestration overhead for lakehouse-native pipelines

SA Rule of Thumb: Most enterprises need batch + CDC to cover 90% of use cases. Streaming is a deliberate choice for specific latency-sensitive use cases — not a default.