Deduplication at Scale — Data Cleaning & Insights Advanced Task | Graduates Hub

The Scenario

A sales team has been entering customer data manually for 5 years. The CRM now has 500,000 contact records, but an estimated 15% are duplicates caused by typos, abbreviations ("Pty Ltd" vs "Proprietary Limited"), merged company names, and inconsistent phone formats.

The Brief

Design a deduplication pipeline. You must choose blocking/indexing strategies to avoid comparing every pair (O(n²)), select similarity metrics (Levenshtein, Jaro-Winkler, phonetic), and define merge rules for when two records are "probably the same".

Deliverables

A pipeline diagram showing the stages: standardisation → blocking → comparison → classification → merging
Your blocking strategy and why it reduces comparisons from 125 billion to a manageable number
The similarity metrics you would use for name, email, phone, and address fields, with thresholds
A merge policy: which record becomes the "golden record" and how conflicting field values are resolved

Submission Guidance

This is a systems-thinking task. You do not need to write production code, but your pipeline must be specific enough that a developer could implement it.

Submit Your Work

Your submission is graded against the rubric on the right. If you pass, you get a public Badge URL you can share on LinkedIn. There is no draft save, so work offline first and paste your finished response here.