8. Stream and Batch Processing with Dataflow

Stream and Batch Processing with Dataflow¶

Section at a Glance¶

What you'll learn: - The fundamental "Unified Model" that distinguishes Dataflow from traditional MapReduce or Spark architectures. - How to implement windowing, watermarks, and triggers to handle late-arriving data in streaming pipelines. - The architectural differences between the Dataflow Runner and the Apache Beam SDK. - Strategies for optimizing pipeline performance and managing cost in both batch and streaming modes. - Integration patterns for building end-to-end production-grade ETL/ELT pipelines on GCP.

Key terms: Apache Beam · PCollection · PTransform · Watermark · Windowing · Streaming Engine · Side Inputs · Windowing

TL;DR: Dataflow is a fully managed, serverless service for executing unified stream and batch data processing pipelines using the Apache Beam SDK, eliminating the operational overhead of managing clusters while handling both real-time and historical data with a single code base.

Overview¶

In the modern enterprise, data arrives in two distinct velocities: the "historical" (batch) and the "immediate" (stream). Historically, companies had to maintain two separate codebases and two separate engineering teams—one for batch processing (e.g., Hadoop/MapReduce) and one for real-stream processing (e.g., Apache Flink or Storm). This "Lambda Architecture" created massive technical debt, as logic had to be duplicated and synchronized across different systems.

Dataflow solves this "dual-pipeline" problem by providing a Unified Model. Built on the Apache Beam SDK, Dataflow allows a single piece of logic to process a static file in Cloud Storage (Batch) and a continuous stream from Pub/Sub (Streaming) with identical semantics.

For a business, this means reduced engineering overhead, faster time-to-market for new features, and significantly higher data consistency. Instead of managing infrastructure, your engineers focus on the transformation logic, while Dataflow handles the autoscaling, provisioning, and resource management.

Core Concepts¶

To master Dataflow, you must move beyond thinking about "rows in a table" and start thinking about "unbounded collections of data."

The Apache Beam SDK: Dataflow is the execution engine (the Runner), while Apache Beam is the programming model. You write your code in Beam (Python, Java, or Go), and Dataflow executes it.
PCollection: The core abstraction. A PCollection represents a distributed data set.
- In Batch, a PCollection is bounded (it has a clear beginning and end).
- In Streaming, a PCollection is unbounded (it is an infinite stream of data).
PTransform: The operations performed on your data (e.g., ParDo, GroupByKey, Combine).
Windowing: The process of dividing a PCollection into finite chunks based on time or other criteria.
- Fixed Windows: Time intervals of equal length (e.g., every 5 minutes).
- Sliding Windows: Overlapping intervals (e.g., a 10-minute window that starts every 5 minutes).
- Session Windows: Windows defined by periods of activity followed by gaps of inactivity.
Watermarks: 📌 Must Know: The watermark is Dataflow's notion of progress in event time. It is the system's way of saying, "I am reasonably sure that no more data with an event timestamp earlier than $X$ will arrive."
- Customer Query: "What happens if data arrives after the watermark has passed?"
- SA Answer: "That is considered 'late data.' You must define Triggers to decide whether to drop that data, update your previous results, or trigger a special corrective pane."
Triggers: The mechanism that determines when a window is actually "emitted" or materialized as an output.
- ⚠️ Warning: Using overly frequent triggers in a streaming pipeline can lead to "state explosion," where the system spends more resources managing window metadata than processing actual data, significantly driving up costs.

Architecture / How It Works¶

Dataflow abstracts the underlying Compute Engine instances, but understanding the orchestration is vital for troubleshooting.

graph LR
    subgraph "Data Sources"
        A[Cloud Storage - Batch] --> D
        B[Pub/Sub - Stream] --> D
    end

    subgraph "Dataflow Service"
        D[Dataflow Runner] --> E[Worker Instances - Compute Engine]
        E --> F[Dataflow Shuffle Service]
        E --> G[Streaming Engine]
    end

    subgraph "Data Sinks"
        E --> H[BigQuery]
        E --> I[Cloud Bigtable]
    end

Dataflow Runner: The orchestration layer that parses your Beam pipeline and converts it into a directed acyclic graph (DAG) of execution steps.
Worker Instances: The underlying Compute Engine VMs that perform the actual PTransforms on your data.
Dataflow Shuffle Service: A service-side implementation of the "Shuffle" operation (re-distributing data across workers), which removes the need for workers to store intermediate data on local disks.
Streaming Engine: An offload mechanism that moves window state and many shuffle operations from the worker VMs to a specialized backend, allowing for much smoother autoscaling and smaller, more efficient workers.

Comparison: When to Use What¶

Choosing the right tool is a frequent architecture review topic.

Option	Best For	Trade-offs	Approx. Cost Signal
Dataflow	Complex, unified stream/batch processing; heavy transformations.	Higher cost per unit of compute compared to simple SQL.	Moderate/High (Managed/Serverless)
BigQuery SQL	Simple, high-volume aggregations and transformations (ELT).	Limited to SQL; cannot handle complex custom logic or complex windowing.	Low (Pay per query/slot)
Dataproc	Migrating existing Spark/Hadoop workloads without rewriting code.	High operational overhead (you manage clusters and scaling).	Variable (Cluster uptime)

How to choose: If your logic can be expressed in standard SQL and the data is already in BigQuery, use BigQuery. If you have complex, non-SQL logic or need to ingest from Pub/Sub and transform it in real-time, choose Dataflow.

Cost Cheat Sheet¶

Scenario	Recommended Option	Key Cost Driver	Watch Out For
Simple Batch ETL	Dataflow (Standard)	VCPU/Memory usage	Over-provisioning worker types.
High-Volume Streaming	Dataflow + Streaming Engine	Throughput (Bytes processed)	💰 Cost Note: Not using Streaming Engine; this forces workers to handle shuffle, ballooning disk/CPU costs.

💰 Cost Note: The single biggest cost mistake in Dataflow is failing to enable the Streaming Engine. Without it, your workers must manage all the state and shuffle data locally, leading to massive disk I/O and much higher CPU usage as your stream scales.

Service & Tool Integrations¶

Dataflow rarely exists in a vacuum. It is the "glue" of the GCP data ecosystem.

The Ingestion Pattern: Pub/Sub $\to$ Dataflow $\to$ BigQuery. This is the gold standard for real-time analytics.
The Data Lake Pattern: Cloud Storage $\to$ Dataflow $\to$ BigQuery. Used for cleaning and structured-loading raw CSV/JSON/Avro files.
The Feature Store Pattern: Pub/Sub $\to$ Dataflow $\to$ Cloud Bigtable. Essential for low-latency serving of machine learning features.

Security Considerations¶

Dataflow integrates deeply with Google Cloud's security perimeter.

Control	Default State	How to Enable / Strengthen
Encryption at Rest	Enabled (Google Managed)	Use Customer-Managed Encryption Keys (CMEK) via Cloud KMS for higher compliance.
Encryption in Transit	Enabled	Ensure all data sources (like Pub/Sub) are configured for TLS.
Network Isolation	Public Internet (Default)	Deploy Dataflow workers in a VPC with no external IP addresses and use Private Google Access.
Access Control	IAM-based	Use fine-grained IAM roles (e.g., `roles/dataflow.worker`) rather than primitive owner/editor roles.

Performance & Cost¶

Tuning for Performance: * Avoid Hot Keys: If one key (e.g., a single user_id in a massive stream) has 100x more data than others, one worker will do all the work while others sit idle. Use Combine.globally() or sub-keys to redistribute work. * Right-size Workers: Don't use n1-standard-32 for a pipeline that only does simple filtering. Use smaller machines and let the Autoscaling feature do its job.

Cost Scenario Example: Imagine a pipeline processing 1GB of data per hour. * Scenario A (No Streaming Engine): You use n1-standard-1 workers. To handle the shuffle, you need 100GB of persistent disk per worker. Total cost: High due to disk and inefficient CPU usage. * Scenario B (With Streaming Engine): You use the same n1-standard-1 workers but offload shuffle. Total cost: Significantly lower, as you only pay for the minimal compute and the processed bytes, with much smaller, cheaper disks.

Hands-On: Key Operations¶

The following Python snippet demonstrates a basic Beam pipeline that reads from a source and writes to BigQuery.

Step 1: Define the pipeline logic. This script reads a text file, extracts a word, and prepares it for a BigQuery table.

import apache_beam as beam
from apache_beam.options.pipeline_options import PipelineOptions

def run_pipeline():
    # Define your pipeline options (e.g., project, runner, temp_location)
    options = PipelineOptions(runner='DataflowRunner', project='my-gcp-project')

    with beam.Pipeline(options=options) as p:
        (
            p 
            | 'ReadFromGCS' >> beam.io.ReadFromText('gs://my-bucket/input.txt')
            | 'ExtractWord' >> beam.Map(lambda line: {'word': line.split()[0], 'count': 1})
            | 'WriteToBQ' >> beam.io.WriteToBigQuery(
                'my-project:my_dataset.my_table',
                schema='word:STRING, count:INTEGER',
                write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
            )
        )

if __name__ == '__main__':
    run_pipeline()

💡 Tip: When testing locally, use the DirectRunner instead of DataflowRunner. It allows you to debug your logic on your laptop without incurring GCP costs or waiting for cluster provisioning.

Customer Conversation Angles¶

Q: We already have BigQuery. Why do we need Dataflow? A: BigQuery is incredible for SQL-based transformations on data already in the warehouse, but Dataflow is for the "processing" part of ETL—handling unstructured data, performing complex windowing, or transforming data before it ever hits BigQuery to save on storage and ingest costs.

Q: Is Dataflow more expensive than Dataproc? A: It depends on your operational capacity. Dataproc might have a lower "per-compute" cost, but you must pay for the engineering hours spent managing, patching, and scaling the clusters. Dataflow is "hands-off," which often results in a lower Total Cost of Ownership (TCO).

Q: How do we handle data that arrives 2 hours late? A: We use Dataflow's Watermarking and Triggering capabilities. We can define a "late data" policy that allows the pipeline to update existing BigQuery rows whenever late data arrives, ensuring eventual accuracy.

Q: Can Dataflow handle both my real-time clickstream and my nightly CSV uploads? A: Yes, that is the primary advantage. You can use a single Beam pipeline and simply switch the source from a Pub/Sub topic to a GCS file path.

Q: We are worried about "Hot Keys" slowing down our pipeline. How does Dataflow help? A: Dataflow's autoscaling and the use of the Shuffle Service help mitigate this, but we would also implement architectural patterns like "combiner lifting" to aggregate data before the heavy shuffle occurs.

Common FAQs and Misconceptions¶

Q: Does Dataflow scale infinitely? A: While highly scalable, it is subject to the limits of the underlying Google Cloud infrastructure and your quota for Compute Engine resources. ⚠️ Warning: Never assume "infinite" scaling; always monitor your system_lag metric.

Q: If I use the Streaming Engine, do I still pay for workers? A: Yes, you still pay for the Compute Engine VMs (the workers) that execute your code. The Streaming Engine just offloads the "heavy lifting" of state management.

Q: Can I run Dataflow using only SQL? A: Not directly. You use the Apache Beam SDK (Python/Java) or Beam SQL. You still need a "Runner" to execute the logic.

Q: Does Dataflow support Python for all features? A: Most features are supported, but some advanced Java-centric Beam features might have a delay in Python support. Always check the compatibility matrix for complex transforms.

Q: Is Dataflow a replacement for Pub/Sub? A: No. Pub/Sub is the transport layer (the messenger); Dataflow is the processing layer (the brain). You almost always use them together.

Exam & Certification Focus¶

Unified Model (Domain: Data Processing): Understand how the same code handles both Bounded (Batch) and Unbounded (Stream) datasets. 📌 High Frequency.
Windowing & Watermarks (Domain: Data Processing): Be able to identify which windowing type (Fixed, Sliding, Session) is appropriate for a given business requirement. 📌 High Frequency.
Choosing between Dataflow and Dataproc (Domain: Architecture): Know when to use Serverless (Dataflow) vs. Managed Cluster (Dataproc).
Dataflow Shuffle & Streaming Engine (Domain: Optimization): Understand how these services impact performance and cost-efficiency.

Quick Recap¶

Dataflow provides a Unified Model for both batch and streaming data using Apache Beam.
Watermarks track progress in event time; Triggers handle late-arriving data.
Streaming Engine is a critical component for cost-effective, scalable streaming pipelines.
Use BigQuery for simple SQL; use Dataflow for complex, multi-step transformations and real-time ingestion.
Always design for Hot Keys to prevent bottlenecks in distributed processing.