Server Performance Investigation Case Study — Server & OS Administration Advanced Task

The Scenario

A SA e-commerce business runs its checkout API on a Linux server that has started slowing down every weekday at 17:00 to 18:00. Checkout response times spike from 200ms to 4 seconds. The infrastructure team asks for a structured investigation, root cause, and remediation plan.

The Brief

Produce a complete performance investigation case study. You may model the scenario or use real data from a personal lab. The case study must read like a senior engineer's post-incident report.

Deliverables

A symptom and impact section quantifying the issue: latency percentiles before and during the incident window, error rates, and revenue or user impact
A diagnostic walkthrough showing the methodology: which metrics were checked first (CPU, memory, IO, network, application), which tools were used (top, iostat, vmstat, perf, application APM), and what each ruled in or out
A root-cause analysis identifying the underlying cause (e.g., database connection pool exhaustion, disk IO saturation from a backup job, memory pressure from a leaking process) with supporting evidence
A remediation plan covering: immediate fix, medium-term hardening, and the monitoring or alerting added so this issue is detected within minutes if it recurs

Submission Guidance

Senior engineers narrate their investigation: "I checked CPU first, ruled it out, then looked at IO." The case study must show that thinking, not just the answer. Show the dead ends as well as the breakthrough.

Submit Your Work

Your submission is graded against the rubric on the right. If you pass, you get a public Badge URL you can share on LinkedIn. There is no draft save, so work offline first and paste your finished response here.