http://cdn.openai.com/pdf/a794887b-5a77-4207-bb62-e52c900463f1/penda_paper.pdf
🚨 New preprint from OpenAI + Penda Health, a large network of primary care clinics in Nairobi, Kenya.
Unlike most LLM research that lives in theory or simulation (for example, testing on NEJM Challenge Cases or benchmarking questions), this was tested live across nearly 40,000 real patient visits.
They deployed AI Consult, an LLM that reviews clinician notes and flags potential issues (see example in screenshot). It uses traffic-light colors to indicate level of concern.
Half of the clinicians were randomly given access, half were not.
Key Results:
- 🩺 32% reduction in history-taking errors
- 🔍 16% reduction in diagnostic errors
- 💊 14% reduction in treatment errors
- ✅ 100% of surveyed clinicians who responded (only 67% completed it) said it was helpful
- ⚠️ No safety harms identified
The “left in red” rate (visits where the final AI Consult call was red) in the AI group dropped to 20% from 35-40% (similar to non-AI group at start), while the non-AI group’s rate remained ~40%. This indicates clinicians were acting on the most severe alerts.
This is one of the clearest real-world wins for LLMs in healthcare to date. Yes, OpenAI funded and helped analyze the study—so read with a grain of salt—but the results are promising and cool!