Design an A/B Testing Framework — AI Evaluation & Benchmarking Advanced Task | Graduates Hub

The Scenario

Prompt A is short and uses zero-shot. Prompt B is long and uses few-shot. Both are live in your app, routing 50% of traffic to each. You need to design the framework to decide which one is actually better for the business.

The Brief

Write an evaluation strategy document. How will you measure success beyond just "the text looks nice"?

Deliverables

The core business metric you will track (e.g., thumbs up/down, copy-paste events, latency)
The automated evaluation metrics (e.g., JSON parse failure rate)
A strategy for handling latency vs. quality trade-offs (Prompt B is better but takes 3 seconds longer to generate)
The threshold for declaring a "winner"

Submission Guidance

Prompt engineering in production is an engineering discipline, not creative writing. Latency, token cost, and API failure rates matter just as much as prose quality.

Submit Your Work

Your submission is graded against the rubric on the right. If you pass, you get a public Badge URL you can share on LinkedIn. There is no draft save, so work offline first and paste your finished response here.