The Scenario
A sales team has been entering customer data manually for 5 years. The CRM now has 500,000 contact records, but an estimated 15% are duplicates caused by typos, abbreviations ("Pty Ltd" vs "Proprietary Limited"), merged company names, and inconsistent phone formats.
The Brief
Design a deduplication pipeline. You must choose blocking/indexing strategies to avoid comparing every pair (O(n²)), select similarity metrics (Levenshtein, Jaro-Winkler, phonetic), and define merge rules for when two records are "probably the same".
Deliverables
- A pipeline diagram showing the stages: standardisation → blocking → comparison → classification → merging
- Your blocking strategy and why it reduces comparisons from 125 billion to a manageable number
- The similarity metrics you would use for name, email, phone, and address fields, with thresholds
- A merge policy: which record becomes the "golden record" and how conflicting field values are resolved
Submission Guidance
This is a systems-thinking task. You do not need to write production code, but your pipeline must be specific enough that a developer could implement it.
Submit Your Work
Your submission is graded against the rubric on the right. If you pass, you get a public Badge URL you can share on LinkedIn. There is no draft save, so work offline first and paste your finished response here.