Data Ingestion & Pipelines¶

Reference for SAs discussing data movement strategies with customers. Covers ingestion patterns, ETL vs. ELT, CDC, and the tooling landscape — with the emphasis on helping customers understand what they have and what it maps to in a modern stack.

Ingestion Pattern Overview¶

flowchart LR
    subgraph Sources
        DB[(Databases)]
        API[APIs / SaaS]
        FILES[Files / FTP]
        STREAM[Event Streams]
        IOT[IoT / Sensors]
    end
    subgraph Patterns
        BATCH[Batch]
        MICRO[Micro-Batch]
        STREAMING[Streaming]
        CDC[CDC]
    end
    subgraph Landing
        LAKE[Data Lake / Lakehouse\nBronze Layer]
        WAREHOUSE[Data Warehouse\nStaging]
        QUEUE[Message Queue\nKafka / Kinesis]
    end
    DB --> BATCH & CDC
    API & FILES --> BATCH
    STREAM & IOT --> STREAMING & MICRO
    BATCH & CDC --> LAKE & WAREHOUSE
    STREAMING & MICRO --> QUEUE --> LAKE

Batch Ingestion¶

What It Is¶

Data is collected and moved in discrete chunks on a schedule — nightly, hourly, or triggered by an upstream event. The workhorse of traditional enterprise data pipelines.

When It Works¶

Source systems don't support real-time extraction (legacy ERP, mainframes)
Downstream consumers only need fresh data daily (monthly finance reports)
Volume is too large to process continuously but manageable in windows

Common Tools¶

Tool	Notes
Azure Data Factory	Managed, GUI-driven, native Azure integration
AWS Glue	Serverless Spark-based ETL on AWS
Apache Airflow	Open-source orchestrator, highly flexible
dbt	SQL-based transformation layer, not a full ETL tool
Informatica	Enterprise-grade, often found in regulated industries
Talend	Open-core, common in mid-market

SA Talking Points¶

Batch is not "legacy" — it is still the right choice for many workloads
The customer question is: "What is the acceptable data freshness for this use case?" — if daily is fine, batch is fine
Batch orchestration (Airflow, ADF) is often where migration complexity hides: complex DAGs, retry logic, SLA alerting

ETL vs. ELT¶

Traditional ETL (Extract → Transform → Load)¶

Data is extracted from the source, transformed in a middle processing layer (the ETL server), then loaded into the destination in a clean, ready-to-use form.

Characteristics: - Transformation happens before the data reaches the destination - Heavy reliance on middleware (Informatica, SSIS, DataStage) - Source data is often not retained in raw form - Compute lives in the ETL server, not the destination

Modern ELT (Extract → Load → Transform)¶

Data is extracted and loaded raw into the destination first (data lake or warehouse), then transformed using the destination's own compute.

Characteristics: - Raw data is preserved — reprocessing is always possible - Transformation uses SQL or Spark inside the target platform - Enabled by cheap cloud storage and elastic compute (BigQuery, Databricks, Snowflake) - dbt is the dominant SQL transformation layer in ELT stacks

Comparison¶

	ETL	ELT
Raw data retained	Rarely	Always
Compute location	Middleware server	Target platform
Flexibility	Lower — transform design upfront	Higher — re-transform as needed
Typical tools	Informatica, SSIS, DataStage	dbt, Spark, BigQuery, Databricks
Best for	Legacy systems, regulated data masking	Cloud-native, lakehouse, agile teams

SA Talking Points¶

Most modernization conversations are ETL → ELT migrations
"What happens when business rules change?" — in ETL, you re-engineer the pipeline; in ELT, you update a SQL model
Legacy ETL tools (Informatica, DataStage) are often the largest migration risk — logic buried in proprietary GUIs

Change Data Capture (CDC)¶

What It Is¶

A technique that captures only the rows that have changed in a source database (inserts, updates, deletes) rather than reloading the entire table. Enables near-real-time replication with minimal source system load.

How It Works (Database Log-Based CDC)¶

sequenceDiagram
    participant App as Application
    participant DB as Source DB
    participant LOG as Transaction Log\n(binlog / WAL / redo log)
    participant CDC as CDC Tool\n(Debezium / Fivetran)
    participant SINK as Target\n(Kafka / Lake / Warehouse)

    App->>DB: INSERT / UPDATE / DELETE
    DB->>LOG: Writes change record
    CDC->>LOG: Tails log continuously
    CDC->>SINK: Publishes change event

CDC Approaches¶

Approach	Mechanism	Pros	Cons
Log-based	Reads DB transaction log	Low source impact, captures deletes	Requires DB-level access
Trigger-based	DB triggers write to audit table	No special permissions	High source DB overhead
Timestamp-based	Polls for rows updated after last run	Simple to implement	Misses hard deletes
Full table diff	Compares snapshot to previous	Catches everything	Expensive at scale

Common CDC Tools¶

Tool	Notes
Debezium	Open-source, Kafka-native, wide DB support
Fivetran / Airbyte	Managed connectors, easy setup, less flexibility
AWS DMS	Managed CDC for AWS targets
Oracle GoldenGate	Enterprise, complex, expensive — common in regulated industries
Qlik Replicate (Attunity)	Often found in SAP / Oracle environments

SA Talking Points¶

CDC is how customers get from "nightly batch" to "data available within minutes"
The hard part is not the technology — it is schema evolution: what happens when the source table changes?
CDC is also how you replicate SAP, Oracle, and mainframe data without impacting the source system

Streaming Ingestion¶

What It Is¶

Data is produced, captured, and processed continuously as events occur — no scheduled batch window. Events flow through a message broker (Kafka, Kinesis) and are consumed by stream processors.

Architecture¶

flowchart LR
    PRODUCER[Producers\nApps / Sensors / CDC] -->|events| BROKER[Message Broker\nKafka / Kinesis / Pub-Sub]
    BROKER -->|consume| PROCESSOR[Stream Processor\nSpark Streaming / Flink / Kafka Streams]
    PROCESSOR -->|write| SINK[Sinks]
    SINK --> LAKE[Lakehouse\nDelta / Iceberg]
    SINK --> SEARCH[Search / Elasticsearch]
    SINK --> CACHE[Cache / Redis]
    SINK --> BI[Real-Time Dashboard]

When Streaming Is the Right Answer¶

Fraud detection — decisions must be made in milliseconds
IoT / operational monitoring — sensor data needs immediate alerting
Customer-facing features — real-time personalization, live inventory
Operational analytics — operational teams need current-hour data, not yesterday's

When Streaming Is Overkill¶

Finance / compliance reporting — daily is sufficient, streaming adds cost and complexity
Large historical backfills — batch is more efficient
Sources that don't change frequently (reference data, product catalogs)

SA Talking Points¶

Streaming is expensive to build and operate — push back on "we need real-time" without a clear use case
Ask: "What decision changes if the data is 5 minutes old vs. 5 hours old?" — if the answer is "nothing," batch is sufficient
Kafka is often already in the environment (event bus for microservices) — the question is whether to tap it for analytics

Pipeline Orchestration¶

What It Is¶

The scheduling, dependency management, monitoring, and retry logic that coordinates all pipeline steps — the "control plane" of a data platform.

Key Players¶

Tool	Model	Best For
Apache Airflow	Python DAGs, self-hosted or managed	Complex dependencies, multi-system workflows
Databricks Workflows	Native to Databricks, Jobs API	Databricks-centric pipelines
Azure Data Factory	GUI + ARM, managed	Azure-native, less Python-savvy teams
AWS Step Functions	Serverless, event-driven	AWS-native, event-triggered workflows
Prefect / Dagster	Modern Python orchestrators	Data engineering teams, strong observability
dbt Cloud	SQL transformation scheduling	Analytics engineering, BI-focused teams

SA Talking Points¶

Orchestration is often where the real migration complexity lives — Airflow DAGs can have hundreds of tasks with subtle dependencies
"What happens when a pipeline fails at 2am?" — if the answer is unclear, observability and alerting is a gap
Databricks Workflows + Delta Live Tables reduces orchestration overhead for lakehouse-native pipelines

SA Rule of Thumb: Most enterprises need batch + CDC to cover 90% of use cases. Streaming is a deliberate choice for specific latency-sensitive use cases — not a default.