Real-Time & Streaming Data¶
Reference for SAs discussing streaming architectures with customers. Covers the core concepts, major platforms, architectural patterns, and — critically — when streaming is and isn't the right answer.
The Streaming Landscape¶
flowchart LR
subgraph Producers
APP[Applications]
IOT[IoT / Sensors]
CDC[CDC / DB Changes]
CLICK[Clickstream]
end
subgraph Brokers["Message Brokers / Event Streaming Platforms"]
KAFKA[Apache Kafka]
KINESIS[AWS Kinesis]
PUBSUB[Google Pub/Sub]
EVENTHUB[Azure Event Hubs]
SERVICEBUS[Azure Service Bus]
end
subgraph Processors["Stream Processors"]
SPARK[Spark Structured Streaming]
FLINK[Apache Flink]
KSTREAMS[Kafka Streams / ksqlDB]
DATAFLOW[Google Dataflow]
end
subgraph Sinks
LAKE[Lakehouse\nDelta / Iceberg]
SEARCH[Search\nElasticsearch]
CACHE[Cache / Redis]
REALTIME[Real-Time Dashboard]
ALERT[Alerting System]
end
Producers --> Brokers --> Processors --> Sinks
Core Concepts¶
Events vs. Messages¶
- Event: An immutable record that something happened — "Order #123 was placed at 14:32:07"
- Message: A command or request to do something — "Process payment for Order #123"
Both flow through brokers, but the semantics differ. Analytics workloads deal with events; microservices deal with messages.
Topics and Partitions (Kafka Model)¶
- Topic: A named stream of events (e.g.,
orders,page_views,sensor_readings) - Partition: A topic is divided into partitions for parallel consumption. Events within a partition are ordered; across partitions they are not.
- Consumer Group: Multiple consumers reading the same topic in parallel, each partition assigned to one consumer
Topic: orders
├── Partition 0: [event1, event4, event7, ...]
├── Partition 1: [event2, event5, event8, ...]
└── Partition 2: [event3, event6, event9, ...]
Consumer Group A (BI pipeline): reads all partitions
Consumer Group B (Fraud service): reads all partitions independently
Delivery Guarantees¶
| Guarantee | Meaning | Use Case |
|---|---|---|
| At-most-once | May lose events, never duplicates | Non-critical metrics |
| At-least-once | Never loses events, may duplicate | Most analytics workloads |
| Exactly-once | No loss, no duplicates | Financial transactions |
Windowing (for Aggregations)¶
| Window Type | Definition | Example |
|---|---|---|
| Tumbling | Fixed, non-overlapping time windows | Count orders every 5 minutes |
| Sliding | Fixed duration, moves continuously | Rolling 1-hour average, updated every 1 minute |
| Session | Dynamic, gap-based (ends after inactivity) | User session duration |
Message Brokers¶
Apache Kafka¶
The dominant open-source event streaming platform. Designed for high-throughput, durable, replicated event logs.
Key characteristics: - Events are retained on disk (configurable retention — days, weeks, or infinite) - Pull-based consumption — consumers control their read position (offset) - Partitioned for horizontal scalability - Replication across brokers for fault tolerance
Managed offerings: Confluent Cloud, AWS MSK, Azure Event Hubs (Kafka-compatible), Aiven
When to use Kafka: - High-throughput event streams (millions of events/second) - Events need to be replayed — multiple consumer groups reading the same topic independently - You need durable event log semantics (not fire-and-forget)
AWS Kinesis Data Streams¶
AWS-native, fully managed event streaming. Kafka-compatible at a conceptual level but different API.
Key characteristics: - Shard-based (equivalent to partitions) — throughput scales by adding shards - Retention up to 365 days (default 24 hours) - Tight integration with AWS Lambda, Firehose, and Glue
When to use Kinesis: AWS-native architectures, teams that don't want to manage Kafka
Azure Event Hubs¶
Azure's managed event streaming service. Offers a Kafka-compatible endpoint — existing Kafka producers/consumers work without code changes.
When to use Event Hubs: Azure-native architectures, teams already using Kafka that want a managed alternative
Google Pub/Sub¶
GCP's managed messaging service. Designed for global fan-out at massive scale.
Key difference from Kafka: Pub/Sub is push-based (or pull) and does not retain events in the same ordered, offset-based way. Less suitable for replay use cases.
When to use Pub/Sub: GCP-native, event fan-out to many subscribers, integration with Dataflow
Stream Processing Engines¶
Spark Structured Streaming¶
Extends Apache Spark's DataFrame API to streaming. Processes data as micro-batches (or in continuous mode) and integrates natively with Delta Lake.
Best for: - Teams already using Spark/Databricks for batch — same code patterns - Writing to Delta Lake (native integration, ACID guarantees) - Unified batch + streaming pipelines
Latency: Seconds to sub-second (micro-batch); not suitable for millisecond requirements
Apache Flink¶
True streaming engine (event-by-event processing). The gold standard for low-latency, stateful stream processing.
Best for: - Millisecond latency requirements (fraud detection, trading systems) - Complex stateful operations (joins across streams, sessionization) - High-throughput exactly-once processing
Managed offerings: AWS Kinesis Data Analytics (Flink), Azure HDInsight (Flink), Confluent Cloud (Flink)
Kafka Streams / ksqlDB¶
Stream processing inside the Kafka ecosystem. No separate cluster needed — processing runs in the Kafka client.
Best for: - Simple transformations, filtering, enrichment within Kafka - Teams that don't want to manage a separate processing cluster - SQL-based stream processing (ksqlDB)
Comparison¶
| Spark Streaming | Apache Flink | Kafka Streams | |
|---|---|---|---|
| Latency | Seconds (micro-batch) | Milliseconds (true streaming) | Milliseconds |
| State management | Good | Excellent | Good |
| Exactly-once | Yes (with Delta) | Yes | Yes |
| Learning curve | Low (if using Spark already) | High | Medium |
| Best for | Lakehouse pipelines | Low-latency, complex state | Kafka-native transforms |
Lambda vs. Kappa Architecture¶
Lambda Architecture (Batch + Streaming)¶
Maintains two parallel pipelines: a batch layer for accurate historical processing and a speed layer for real-time approximation. Results are merged at query time.
Sources → Batch Layer (Spark batch) → Batch Views ↘
→ Speed Layer (Spark Streaming) → Realtime Views → Serving Layer → User
Problem: Two codebases to maintain, two sets of bugs, results can diverge.
Kappa Architecture (Streaming Only)¶
Processes everything as a stream. Historical reprocessing is done by replaying the event log from the broker.
Problem: Requires durable event storage (Kafka with long retention), reprocessing can be slow for years of history.
Modern Lakehouse Approach (Unified)¶
Spark Structured Streaming + Delta Lake effectively collapses batch and streaming into one pipeline — the same code runs in batch or streaming mode. This is the practical answer for most Databricks customers.
Common Streaming Architecture Patterns¶
Pattern 1: Real-Time Ingestion to Lakehouse¶
Most common pattern — streaming data lands in Delta Lake bronze, then processes through silver/gold on a micro-batch cadence.
Latency: 30 seconds to 5 minutes end-to-end (sufficient for most BI use cases)
Pattern 2: Real-Time Alerts & Dashboards¶
Events processed in-stream and pushed to alerting or real-time visualization tools.
Latency: Milliseconds to seconds
Pattern 3: Stream Enrichment (Join with Reference Data)¶
Incoming events enriched with reference data (e.g., joining clickstream events with customer profile data).
Clickstream (Kafka) ──────────────────→ Flink Join → Enriched Stream
Customer Profiles (DB / Delta Table) ──┘
When to Use Streaming (and When Not To)¶
Use Streaming When:¶
| Use Case | Why Streaming |
|---|---|
| Fraud detection | Decision must be made before transaction completes |
| IoT / operational monitoring | Alert on anomaly as it happens |
| Real-time personalization | Recommend based on current session behavior |
| Financial market data | Prices change by the millisecond |
| Operational dashboards | Operations teams need current-hour visibility |
Don't Use Streaming When:¶
| Scenario | Better Approach |
|---|---|
| Daily/weekly reports | Batch — cheaper and simpler |
| Historical analysis | Batch — process at rest |
| Reference data updates | Hourly batch or polling |
| Low-volume, low-frequency sources | Batch — streaming overhead not justified |
SA Talking Point — The "Real-Time Tax"¶
Streaming architectures are significantly more expensive and complex to operate than batch: - More moving parts (brokers, processors, state stores) - More failure modes (consumer lag, exactly-once failures, schema mismatches) - Higher operational burden (on-call for broker outages, consumer group drift)
Ask: "What business decision changes if you see this data 15 minutes later?" If the answer is "nothing significant," batch or micro-batch is the right answer.
SA Rule of Thumb: Most "we need real-time" requests are actually "we need fresher data" — hourly micro-batch satisfies 80% of them. Reserve true streaming for latency-sensitive, revenue-impacting use cases.