Handling Missing Data in a Healthcare Dataset — Data Cleaning & Insights Intermediate Task | Graduates Hub

The Scenario

A hospital exported 18 months of patient admission records. The dataset has missing values scattered across columns like Age (3%), Blood Pressure (12%), Diagnosis Code (8%), and Admission Date (0.5%). Each column demands a different imputation strategy because the data is missing for different reasons.

The Brief

You do not have the real file. Describe the imputation strategy you would use for each column type, justify why, and explain the risks of getting it wrong in a healthcare context.

Deliverables

A table listing each column, its missing rate, your chosen strategy (mean, mode, forward-fill, drop, flag, or model-based), and a one-sentence justification
An explanation of the difference between MCAR, MAR, and MNAR and which category each column likely falls into
One risk specific to healthcare data where a bad imputation could lead to a dangerous clinical decision

Submission Guidance

Do not default to "drop all rows with nulls". In healthcare, every row is a patient. Show that you understand the cost of discarding data versus the cost of inventing it.

Submit Your Work

Your submission is graded against the rubric on the right. If you pass, you get a public Badge URL you can share on LinkedIn. There is no draft save, so work offline first and paste your finished response here.