Section 9: Streaming Data with Kinesis and MSK¶

Overview¶

In the world of modern data engineering, the "batch window" is dying. We no longer live in a world where waiting six hours for an ETL job to finish is acceptable. Business stakeholders want real-time dashboards, fraud detection in milliseconds, and instant telemetry updates. This is the domain of Stream Processing.

Streaming data services solve the fundamental problem of decoupling producers from consumers. Without a streaming buffer, if your high-frequency IoT sensor (producer) sends data directly to your database (consumer), a sudden spike in traffic will crash your database. A streaming service like Amazon Kinesis Data Streams (KDS) or Amazon Managed Streaming for Apache Kafka (MSK) acts as a shock absorber. It persists the incoming data for a period of time, allowing consumers to process the data at their own pace, even if they temporarily fall behind.

When choosing between Kinesis and MSK, you are making a fundamental architectural decision. Kinesis is the "AWS-native" choice: it is serverless, highly integrated, and requires much less operational overhead. It is perfect for standard AWS workloads. MSK, on the other as, is the managed version of Apache Kafka. You choose MSK when you have an existing Kafka ecosystem, require specific Kafka-native plugins, or need the massive, highly-customizable throughput that Kafka provides. As a Data Engineer, your job isn't just to "use" these services, but to know which one provides the right balance of operational ease and raw performance for your specific scale.

Core Concepts¶

1. Amazon Kinesis Data Streams (KDS)¶

Shards: The fundamental unit of capacity. One shard provides a fixed unit of throughput: 1 MB/s ingress and 2 MB/s egress. If you need 5 MB/s, you need at least 5 shards.
Partition Key: A string used to distribute data across shards. Warning: Do not use a low-cardinality key (like RegionID). This leads to "Hot Shards," where one shard is overwhelmed while others sit idle. Use high-cardinality keys like UUID or DeviceID.
Retention Period: By default, data is kept for 24 hours. You can extend this up to 365 days (note: this significantly increases costs).
On-Demand vs. Provisioned Mode:
- Provisioned: You manage shards. You pay for the shards you provisioned, regardless of use.
- On-Demand: AWS manages the scaling. It’s great for unpredictable workloads but more expensive per GB than provisioned mode.

2. Amazon Kinesis Data Firehose (KDF)¶

Near-Real-Time: Unlike KDS, KDF is a "delivery" service. It doesn't allow you to "read" the data manually; it pushes it to a destination.
Buffering: KDF uses two buffering hints: Buffer Size (e.g., 5MB) and Buffer Interval (e.m., 60 seconds). The delivery happens when either limit is hit.
Transformation: KDF can trigger an AWS Lambda function to transform raw JSON into Parquet/ORC before the data lands in S3.

3. Amazon MSK (Managed Streaming for Kafka)¶

Brokers: The servers that store the data. Unlike Kinesis, you don't manage "shards," you manage "instances" and "partitions."
Topics & Partitions: Data is organized into Topics. A Topic is split into Partitions, which allow for parallelism.
Zookeeper/KRaft: The coordination mechanism for the Kafka cluster.
Managed Nature: AWS handles the patching, setup, and hardware, but you are still responsible for choosing instance types and managing cluster scaling.

Architecture / How It Works¶

The following diagram illustrates the two primary patterns you will see on the exam: the Serverless Streaming Pipeline (Kinesis) and the Enterprise Kafka Pipeline (MSK).

graph LR
    subgraph "Data Sources"
        A[IoT Sensors] --> KDS
        B[App Logs] --> KDS
        C[Microservices] --> MSK
    end

    subtrograph "Streaming Layer"
        KDS[Kinesis Data Streams] -- "Triggers" --> KDF[Kinesis Data Firehose]
        MSK[Amazon MSK] -- "MSK Connect" --> S3
    end

    subgraph "Processing & Storage"
        KDF -- "Lambda Transform" --> S3[(Amazon S3)]
        KDF --> Redshift[(Amazon Redshift)]
        S3 --> Athena[Amazon Athena]
    end

AWS Service Integrations¶

Data Ingress (Producers)¶

Kinesis Agent: A lightweight Java application installed on EC2/On-prem servers to ship logs to Kinesis.
AWS DMS (Database Migration Service): Can capture changes (CDC) from RDS/Oracle and stream them into Kinesis or MSK.
AWS IoT Core: Directly integrates with Kinesis to route sensor messages.

Data Egress (Consumers)¶

Kinesis Data Analytics (Flink): Performs complex SQL or Flink applications on the live stream.
AWS Lambda: The primary way to trigger real-time compute based on a new record in KDS.
Amazon S3/Redshift: Primarily via Kinesis Data Firehose for "Zero-ETL" patterns.

IAM & Trust Relationships¶

Kinesis Firehose to S3: The Firehose Service Role must have s3:PutObject and s3:GetBucketLocation permissions.
Kinesis Data Streams to Lambda: The Lambda execution role must have kinesis:GetRecords, kinesis:GetShardIterator, and kinesis:DescribeStream.

able ## Security

Identity and Access Management (IAM)¶

Resource-based Policies: Used primarily with MSK (to allow cross-account access) and Kinesis (to restrict which VPCs can access the stream).
Least Privilege: Never use kinesis:*. Always scope to kinesis:PutRecord for producers and kinesis:GetRecords for consumers.

Encryption¶

At Rest:
- Kinesis: Supports SSE-KMS. Use a Customer Managed Key (CMK) if you need to rotate keys or audit usage via CloudTrail.
- MSK: Uses AWS KMS to encrypt EBS volumes and Kafka logs.
In Transit:
- TLS/SSL: Mandatory for MSK production environments.
- VPC Endpoints (PrivateLink): Use Interface VPC Endpoints for Kinesis to ensure data never traverses the public internet. This is a high-priority security requirement in the exam.

Audit & Compliance¶

CloudTrail: Every API call (CreateStream, DeleteStream, UpdateShardCount) is logged here.
VPC Flow Logs: Crucial for auditing network-level access to MSK brokers.

Performance Tuning¶

The "Hot Shard" Problem¶

If you see ReadProvisionedThroughputExceeded on a specific shard, you have a Hot Shard. * The Fix: Check your Partition Key. If you are using CustomerID, and one customer has 100x more events than others, that shard will choke. Change your key to something more granular, like CustomerID + Timestamp.

Kinesis Data Firehose Tuning¶

Buffer Size vs. Cost: Larger buffers mean fewer, larger files in S3 (better for Athena/Glue performance), but higher latency.
Lambda Transformation: If your Lambda transformation takes too long, KDF might time out. Keep transformations lightweight.

MSK Scaling¶

Horizontal Scaling: Add more brokers to the cluster. This requires a rebalance of partitions.
Vertical Scaling: Change the instance type (e.g., moving from kafka.m5.large to kafka.m5.xlarge). This usually involves a rolling update of the cluster.

Important Metrics to Monitor¶

Metric Name (Namespace: `Kinesis`)	What it Measures	Alarm Threshold	Action to Take
`ReadProvisionedThroughputExceeded`	Consumers are being throttled.	$> 0$	Increase shards or optimize consumer logic.
`WriteProvisionedThroughputExceeded`	Producers are being throttled.	$> 0$	Increase shards or check for hot keys.
`IncomingBytes`	Total data volume entering the stream.	Check for unexpected spikes in traffic.
`IteratorAgeMilliseconds`	How far behind the consumer is from the tip of the stream.	$> 60,000$ (1 min)	Scale up consumers or check for processing bottlenecks.
`DeliveryToS3.Success` (Namespace: `Firehose`)	Percentage of successful deliveries to S3.	$< 100\%$	Check IAM permissions or S3 bucket policies.
`Kafka.ConsumerLag` (Namespace: `MSK`)	The gap between the latest offset and the consumer offset.	Growing trend	Scale out consumer group members.

Hands-On: Key Operations¶

1. Creating a Kinesis Data Stream (AWS CLI)¶

# Create a stream named 'ProductionLogs'
# We use 'on-demand' to avoid managing shards for this specific workload
aws kinesis create-stream \
    --stream-name ProductionLogs \
    --stream-mode-capacity on-demand

2. Producing Data to Kinesis (Python/Boto3)¶

import boto3
import json

client = boto3.client('kinesis')

# Data payload
data = {'user_id': 'user_123', 'event': 'login', 'status': 'success'}
payload = json.dumps(data)

# The 'PartitionKey' is CRITICAL. 
# Using 'user_123' ensures all logs for this user go to the same shard.
response = client.put_record(
    StreamName='ProductionLogs',
    Data=payload,
    PartitionKey='user_123' 
)

print(f"Successfully sent record. SequenceNumber: {response['SequenceNumber']}")

3. Consuming Data from Kinesis (Python/Boto3)¶

import boto3
import time

client = boto3.client('kinesis')

# 1. Get the Shard Iterator (The 'pointer' in the stream)
shard_id = 'shardId-000000000000' # In reality, you'd fetch this via describe_stream
iterator = client.get_shard_iterator(
    StreamName='ProductionLogs',
    ShardId=shard_id,
    ShardIteratorType='LATEST'
)['ShardIterator']

# 2. Continuous Loop to poll for records
while True:
    response = client.get_records(ShardIterator=iterator, Limit=10)
    for record in response['Records']:
        print(f"New Record Found: {record['Data'].decode('utf-8')}")

    # Update the iterator to the next position
    iterator = response['NextShardIterator']
    time.sleep(1) # Don't hammer the API

Common FAQs and Misconceptions¶

Q: I need to process data in real-time with SQL. Should I use Kinesis Data Firehose? A: No. Firehose is for delivery (batching data into S3/Redshift). For real-time SQL processing, use Kinesis Data Analytics (Flink).

Q: Does Kinesis Data Streams provide built-in storage for long-term archiving? A: No. Kinesis is a transient buffer. While you can extend retention to 365 days, you should use Kinesis Data Firehose to archive data to Amazon S3 for long-term, low-cost storage.

Q: Can I use the same Partition Key for all my data in Kinesis? A: You can, but you should not. This creates a "Hot Shard" where a single shard handles all the traffic, effectively nullging the benefit of having a multi-shard stream.

Q: Is MSK serverless? A: MSK is "managed," meaning AWS handles the heavy lifting, but it is not "serverless" in the same way Kinesis On-Demand is. You still interact with broker instances and cluster configurations.

Q: Does Kinesis Data Firehose support schema enforcement? A: Not natively, but you can use an AWS Lambda function within the Firehose transformation step to validate or transform the schema before it reaches the destination.

Q: If my Kinesis consumer fails, is the data lost? A: No. As long as the data is within the retention period (default 24h), a new consumer can start reading from a previous checkpoint or the beginning of the stream.

Q: What is the main difference between Kinesis and MSK for an engineer? A: Kinesis is an AWS-native, API-driven service (simpler). MSK is a Kafka-compatible service (more flexible, ecosystem-rich, but more complex).

Q: Can Kinesis Data Firehose write directly to Amazon Redshift? A: Yes, but it actually writes to S3 first and then issues a COPY command to Redshift. This is the standard, high-performance pattern.

Exam Focus Areas¶

Ingestion & Transformation (Domain 1):
- Choosing between KDS (low latency/custom) vs. KDF (delivery/near-real-time).
- Using Lambda for stream transformation in KDF.
- Implementing CDC (Change Data Capture) using DMS and Kinesis.
Store & Manage (Domain 2):
- Partitioning strategies (avoiding Hot Shards).
- Kinesis retention period management.
Operate & Support (Domain 3):
- Monitoring IteratorAge and ProvisionedThroughputExceeded.
- Scaling Kinesis shards (resharding).
- Securing streams using VPC Endpoints and KMS.

Quick Recap¶

Kinesis Data Streams is for real-time, custom-built streaming applications.
Kinesis Data Firehose is for near-real-time delivery to S3, Redshift, or OpenSearch.
Partition Keys are the most critical configuration for preventing performance bottlenecks (Hot Shards).
MSK is the choice for Kafka-native workloads and massive scale.
Scaling Kinesis involves managing shards; scaling MSK involves managing broker instances.
Security requires IAM for access and KMS/TLS for data protection.

Blog & Reference Implementations¶

AWS Big Data Blog: Search for "Kinesis Data Firehose" to learn about advanced transformation patterns.
AWS re:Invent 2023 - Building Real-time Pipelines: Deep dive into Kinesis-to-S3 architectures.
AWS Workshop Studio: "Amazon MSK Workshop" – hands-on cluster setup and producer/consumer labs.
AWS Well-Architected Framework: Review the "Reliability Pillar" for designing resilient streaming architectures.
aws-samples (GitHub): Search for amazon-kinesis-samples to see production-ready Python and Java producers.