Skip to content

Real-Time & Streaming Data

Reference for SAs discussing streaming architectures with customers. Covers the core concepts, major platforms, architectural patterns, and — critically — when streaming is and isn't the right answer.


The Streaming Landscape

flowchart LR
    subgraph Producers
        APP[Applications]
        IOT[IoT / Sensors]
        CDC[CDC / DB Changes]
        CLICK[Clickstream]
    end
    subgraph Brokers["Message Brokers / Event Streaming Platforms"]
        KAFKA[Apache Kafka]
        KINESIS[AWS Kinesis]
        PUBSUB[Google Pub/Sub]
        EVENTHUB[Azure Event Hubs]
        SERVICEBUS[Azure Service Bus]
    end
    subgraph Processors["Stream Processors"]
        SPARK[Spark Structured Streaming]
        FLINK[Apache Flink]
        KSTREAMS[Kafka Streams / ksqlDB]
        DATAFLOW[Google Dataflow]
    end
    subgraph Sinks
        LAKE[Lakehouse\nDelta / Iceberg]
        SEARCH[Search\nElasticsearch]
        CACHE[Cache / Redis]
        REALTIME[Real-Time Dashboard]
        ALERT[Alerting System]
    end
    Producers --> Brokers --> Processors --> Sinks

Core Concepts

Events vs. Messages

  • Event: An immutable record that something happened — "Order #123 was placed at 14:32:07"
  • Message: A command or request to do something — "Process payment for Order #123"

Both flow through brokers, but the semantics differ. Analytics workloads deal with events; microservices deal with messages.

Topics and Partitions (Kafka Model)

  • Topic: A named stream of events (e.g., orders, page_views, sensor_readings)
  • Partition: A topic is divided into partitions for parallel consumption. Events within a partition are ordered; across partitions they are not.
  • Consumer Group: Multiple consumers reading the same topic in parallel, each partition assigned to one consumer
Topic: orders
├── Partition 0: [event1, event4, event7, ...]
├── Partition 1: [event2, event5, event8, ...]
└── Partition 2: [event3, event6, event9, ...]

Consumer Group A (BI pipeline): reads all partitions
Consumer Group B (Fraud service): reads all partitions independently

Delivery Guarantees

Guarantee Meaning Use Case
At-most-once May lose events, never duplicates Non-critical metrics
At-least-once Never loses events, may duplicate Most analytics workloads
Exactly-once No loss, no duplicates Financial transactions

Windowing (for Aggregations)

Window Type Definition Example
Tumbling Fixed, non-overlapping time windows Count orders every 5 minutes
Sliding Fixed duration, moves continuously Rolling 1-hour average, updated every 1 minute
Session Dynamic, gap-based (ends after inactivity) User session duration

Message Brokers

Apache Kafka

The dominant open-source event streaming platform. Designed for high-throughput, durable, replicated event logs.

Key characteristics: - Events are retained on disk (configurable retention — days, weeks, or infinite) - Pull-based consumption — consumers control their read position (offset) - Partitioned for horizontal scalability - Replication across brokers for fault tolerance

Managed offerings: Confluent Cloud, AWS MSK, Azure Event Hubs (Kafka-compatible), Aiven

When to use Kafka: - High-throughput event streams (millions of events/second) - Events need to be replayed — multiple consumer groups reading the same topic independently - You need durable event log semantics (not fire-and-forget)

AWS Kinesis Data Streams

AWS-native, fully managed event streaming. Kafka-compatible at a conceptual level but different API.

Key characteristics: - Shard-based (equivalent to partitions) — throughput scales by adding shards - Retention up to 365 days (default 24 hours) - Tight integration with AWS Lambda, Firehose, and Glue

When to use Kinesis: AWS-native architectures, teams that don't want to manage Kafka

Azure Event Hubs

Azure's managed event streaming service. Offers a Kafka-compatible endpoint — existing Kafka producers/consumers work without code changes.

When to use Event Hubs: Azure-native architectures, teams already using Kafka that want a managed alternative

Google Pub/Sub

GCP's managed messaging service. Designed for global fan-out at massive scale.

Key difference from Kafka: Pub/Sub is push-based (or pull) and does not retain events in the same ordered, offset-based way. Less suitable for replay use cases.

When to use Pub/Sub: GCP-native, event fan-out to many subscribers, integration with Dataflow


Stream Processing Engines

Spark Structured Streaming

Extends Apache Spark's DataFrame API to streaming. Processes data as micro-batches (or in continuous mode) and integrates natively with Delta Lake.

Best for: - Teams already using Spark/Databricks for batch — same code patterns - Writing to Delta Lake (native integration, ACID guarantees) - Unified batch + streaming pipelines

Latency: Seconds to sub-second (micro-batch); not suitable for millisecond requirements

True streaming engine (event-by-event processing). The gold standard for low-latency, stateful stream processing.

Best for: - Millisecond latency requirements (fraud detection, trading systems) - Complex stateful operations (joins across streams, sessionization) - High-throughput exactly-once processing

Managed offerings: AWS Kinesis Data Analytics (Flink), Azure HDInsight (Flink), Confluent Cloud (Flink)

Kafka Streams / ksqlDB

Stream processing inside the Kafka ecosystem. No separate cluster needed — processing runs in the Kafka client.

Best for: - Simple transformations, filtering, enrichment within Kafka - Teams that don't want to manage a separate processing cluster - SQL-based stream processing (ksqlDB)

Comparison

Spark Streaming Apache Flink Kafka Streams
Latency Seconds (micro-batch) Milliseconds (true streaming) Milliseconds
State management Good Excellent Good
Exactly-once Yes (with Delta) Yes Yes
Learning curve Low (if using Spark already) High Medium
Best for Lakehouse pipelines Low-latency, complex state Kafka-native transforms

Lambda vs. Kappa Architecture

Lambda Architecture (Batch + Streaming)

Maintains two parallel pipelines: a batch layer for accurate historical processing and a speed layer for real-time approximation. Results are merged at query time.

Sources → Batch Layer (Spark batch)   → Batch Views  ↘
       → Speed Layer (Spark Streaming) → Realtime Views → Serving Layer → User

Problem: Two codebases to maintain, two sets of bugs, results can diverge.

Kappa Architecture (Streaming Only)

Processes everything as a stream. Historical reprocessing is done by replaying the event log from the broker.

Sources → Streaming Layer (Flink / Spark Streaming) → Serving Layer → User

Problem: Requires durable event storage (Kafka with long retention), reprocessing can be slow for years of history.

Modern Lakehouse Approach (Unified)

Spark Structured Streaming + Delta Lake effectively collapses batch and streaming into one pipeline — the same code runs in batch or streaming mode. This is the practical answer for most Databricks customers.


Common Streaming Architecture Patterns

Pattern 1: Real-Time Ingestion to Lakehouse

Most common pattern — streaming data lands in Delta Lake bronze, then processes through silver/gold on a micro-batch cadence.

Kafka → Spark Structured Streaming → Delta Lake Bronze → DLT Pipeline → Silver / Gold

Latency: 30 seconds to 5 minutes end-to-end (sufficient for most BI use cases)

Pattern 2: Real-Time Alerts & Dashboards

Events processed in-stream and pushed to alerting or real-time visualization tools.

Kafka → Flink → Alert Engine (PagerDuty / Slack)
              → Real-Time Dashboard (Grafana / Kibana)

Latency: Milliseconds to seconds

Pattern 3: Stream Enrichment (Join with Reference Data)

Incoming events enriched with reference data (e.g., joining clickstream events with customer profile data).

Clickstream (Kafka) ──────────────────→ Flink Join → Enriched Stream
Customer Profiles (DB / Delta Table) ──┘

When to Use Streaming (and When Not To)

Use Streaming When:

Use Case Why Streaming
Fraud detection Decision must be made before transaction completes
IoT / operational monitoring Alert on anomaly as it happens
Real-time personalization Recommend based on current session behavior
Financial market data Prices change by the millisecond
Operational dashboards Operations teams need current-hour visibility

Don't Use Streaming When:

Scenario Better Approach
Daily/weekly reports Batch — cheaper and simpler
Historical analysis Batch — process at rest
Reference data updates Hourly batch or polling
Low-volume, low-frequency sources Batch — streaming overhead not justified

SA Talking Point — The "Real-Time Tax"

Streaming architectures are significantly more expensive and complex to operate than batch: - More moving parts (brokers, processors, state stores) - More failure modes (consumer lag, exactly-once failures, schema mismatches) - Higher operational burden (on-call for broker outages, consumer group drift)

Ask: "What business decision changes if you see this data 15 minutes later?" If the answer is "nothing significant," batch or micro-batch is the right answer.


SA Rule of Thumb: Most "we need real-time" requests are actually "we need fresher data" — hourly micro-batch satisfies 80% of them. Reserve true streaming for latency-sensitive, revenue-impacting use cases.