DataAdvanced 3 to 5 hours

Deduplication at Scale

Design a fuzzy-matching deduplication pipeline for a CRM with 500K records.

The Scenario

A sales team has been entering customer data manually for 5 years. The CRM now has 500,000 contact records, but an estimated 15% are duplicates caused by typos, abbreviations ("Pty Ltd" vs "Proprietary Limited"), merged company names, and inconsistent phone formats.

The Brief

Design a deduplication pipeline. You must choose blocking/indexing strategies to avoid comparing every pair (O(n²)), select similarity metrics (Levenshtein, Jaro-Winkler, phonetic), and define merge rules for when two records are "probably the same".

Deliverables

  • A pipeline diagram showing the stages: standardisation → blocking → comparison → classification → merging
  • Your blocking strategy and why it reduces comparisons from 125 billion to a manageable number
  • The similarity metrics you would use for name, email, phone, and address fields, with thresholds
  • A merge policy: which record becomes the "golden record" and how conflicting field values are resolved

Submission Guidance

This is a systems-thinking task. You do not need to write production code, but your pipeline must be specific enough that a developer could implement it.

Submit Your Work

Your submission is graded against the rubric on the right. If you pass, you get a public Badge URL you can share on LinkedIn. There is no draft save, so work offline first and paste your finished response here.

This appears on your public Badge.

0/20000 charactersMarkdown supported

One per line or comma separated. Up to 5 links.

By submitting, you agree your submission text, name, and evaluation will appear on a public Badge URL.