DataIntermediate 2 to 3 hours

Handling Missing Data in a Healthcare Dataset

Decide which imputation strategy to use for each column and defend every choice.

The Scenario

A hospital exported 18 months of patient admission records. The dataset has missing values scattered across columns like Age (3%), Blood Pressure (12%), Diagnosis Code (8%), and Admission Date (0.5%). Each column demands a different imputation strategy because the data is missing for different reasons.

The Brief

You do not have the real file. Describe the imputation strategy you would use for each column type, justify why, and explain the risks of getting it wrong in a healthcare context.

Deliverables

  • A table listing each column, its missing rate, your chosen strategy (mean, mode, forward-fill, drop, flag, or model-based), and a one-sentence justification
  • An explanation of the difference between MCAR, MAR, and MNAR and which category each column likely falls into
  • One risk specific to healthcare data where a bad imputation could lead to a dangerous clinical decision

Submission Guidance

Do not default to "drop all rows with nulls". In healthcare, every row is a patient. Show that you understand the cost of discarding data versus the cost of inventing it.

Submit Your Work

Your submission is graded against the rubric on the right. If you pass, you get a public Badge URL you can share on LinkedIn. There is no draft save, so work offline first and paste your finished response here.

This appears on your public Badge.

0/20000 charactersMarkdown supported

One per line or comma separated. Up to 5 links.

By submitting, you agree your submission text, name, and evaluation will appear on a public Badge URL.