DataAdvanced 3 to 5 hours

Design a Real-Time Streaming Pipeline

Architect a pipeline using Kafka/Spark Streaming for processing clickstream events.

The Scenario

An e-commerce company wants real-time analytics on user behaviour. Currently, clickstream data is batch-loaded every 4 hours. Marketing wants to see what users are doing right now so they can trigger personalised push notifications within 60 seconds of a key event.

The Brief

Design a real-time streaming pipeline. Choose the message broker (Kafka, Kinesis, Pub/Sub), the processing framework (Spark Streaming, Flink, or a simpler consumer), and the output sink (real-time dashboard, notification trigger, or both).

Deliverables

  • An architecture diagram showing producers, broker, consumers, and output sinks
  • Your technology choices with a defense of each (why Kafka over SQS, why Flink over Spark, etc.)
  • How you handle late-arriving events, duplicate events, and consumer failures

Submission Guidance

This is a senior data engineering task. Focus on exactly-once vs at-least-once semantics and how your architecture handles each.

Submit Your Work

Your submission is graded against the rubric on the right. If you pass, you get a public Badge URL you can share on LinkedIn. There is no draft save, so work offline first and paste your finished response here.

This appears on your public Badge.

0/20000 charactersMarkdown supported

One per line or comma separated. Up to 5 links.

By submitting, you agree your submission text, name, and evaluation will appear on a public Badge URL.