Design a Real-Time Streaming Pipeline — ETL & Data Pipelines Advanced Task | Graduates Hub

The Scenario

An e-commerce company wants real-time analytics on user behaviour. Currently, clickstream data is batch-loaded every 4 hours. Marketing wants to see what users are doing right now so they can trigger personalised push notifications within 60 seconds of a key event.

The Brief

Design a real-time streaming pipeline. Choose the message broker (Kafka, Kinesis, Pub/Sub), the processing framework (Spark Streaming, Flink, or a simpler consumer), and the output sink (real-time dashboard, notification trigger, or both).

Deliverables

An architecture diagram showing producers, broker, consumers, and output sinks
Your technology choices with a defense of each (why Kafka over SQS, why Flink over Spark, etc.)
How you handle late-arriving events, duplicate events, and consumer failures

Submission Guidance

This is a senior data engineering task. Focus on exactly-once vs at-least-once semantics and how your architecture handles each.

Submit Your Work

Your submission is graded against the rubric on the right. If you pass, you get a public Badge URL you can share on LinkedIn. There is no draft save, so work offline first and paste your finished response here.