Skip to content
All posts

Healthcare AI · July 1, 2026 · Masano Olivia · 12 min read

Human Oversight Is Necessary but Not a Safety Strategy for Clinical AI

Evidence from primary care in Kenya shows clinical AI can cut errors while still passing harmful recommendations through human review. Oversight is necessary, but until it is designed and measured, it is not a safety strategy.

Human Oversight Is Necessary but Not a Safety Strategy for Clinical AI
Share

The most consequential lesson from the past year of clinical AI is not that the models improved. It is that better models did not settle the safety question. The safety question lives in the interaction between the model and the clinician, and most teams are not designing that interaction with the seriousness it requires. When a health system says its AI is safe because a clinician reviews every output, it is describing a mechanism and calling it a guarantee. Those are not the same thing.

We can now say this with evidence rather than intuition, and the strongest evidence comes from African primary care rather than from academic centers in wealthy countries. That evidence points to an uncomfortable conclusion. Human oversight is necessary, but on its own it is not a safety strategy. It is a component whose reliability depends entirely on how it is designed and whether anyone bothers to measure it.

What the Kenyan deployments actually show

Between January and April 2025, Penda Health and OpenAI studied an electronic medical record copilot called AI Consult across roughly 40,000 patient visits in Nairobi. The tool ran quietly in the background and surfaced a colored cue only when a clinical decision drifted from local protocols. Clinicians who used it recorded a 16 percent relative reduction in diagnostic errors and a 13 percent relative reduction in treatment errors compared with clinicians who did not, with history-taking errors falling by roughly a third (Penda Health & OpenAI, 2025). This is among the most credible real-world results in the field. It was conducted at scale, in routine care, under national ethical review, in exactly the kind of high-volume, resource-constrained setting where most clinical AI is discussed in the abstract and rarely tested.

A second evaluation of the same program, published in early 2026, examined the safety of the model's outputs rather than the net change in error rates. A panel of physicians reviewed 1,469 records from an earlier phase of the deployment. Hallucinations were uncommon at 3.4 percent and mostly trivial, such as a misexpanded acronym. Guidance matched local clinical guidelines in 99 percent of cases. On any benchmark-style reading, the model performed well. And yet reviewers still identified actively harmful recommendations in 7.8 percent of encounters, and 67 of those harmful recommendations reached the final documentation. In 62 percent of encounters, clinicians did not modify their documentation at all after receiving the model's feedback (Kimani et al., 2026).

Read those two studies together and the picture is neither triumphant nor alarming. It is precise. The same underlying model, in the same clinics, can reduce errors overall while still generating harmful recommendations that pass through the clinician and land in the record. The net benefit is real. The failure mode is also real, and it is the failure mode that matters most, because it is the one that oversight was supposed to catch.

We should be careful about what this evidence supports. The controlled comparison and the safety audit used different methods and covered different phases of the same tool, so they are complementary rather than a clean before-and-after. Both studies involved the companies that built and funded the work, and neither measured patient outcomes such as morbidity or recovery. A randomized trial with the global health organization PATH is underway to address that gap. We would grade the evidence for net benefit as moderate and improving, and the evidence for the persistence of harmful outputs under human review as moderate and consistent with a much larger literature. That literature is where the argument becomes general.

Why accuracy and oversight still produce harm

The tendency at the center of this problem has a name. Automation bias is the disposition to accept a machine's output even when an unaided human would have decided correctly. It is not a lapse confined to inattentive or junior staff. It is a robust property of how people work alongside automated systems, and it has been measured for decades across radiology, pathology, and prescribing.

The numbers are strikingly consistent. In a foundational study of a computerized diagnostic aid, clinicians overrode their own correct decisions in favor of wrong advice in about 6 percent of cases. In another, a decision-support tool raised the share of correct answers substantially, from 29 to 50 percent, while simultaneously flipping 7 percent of already-correct answers to incorrect ones (Goddard, Roudsari, & Wyatt, 2012). A more recent controlled study in computational pathology found a 7 percent automation bias rate, where correct assessments were reversed by erroneous AI advice, and found that time pressure made the reversals more severe (Rosbach et al., 2024). In UK general practice, clinicians changed prescriptions in response to decision-support advice in roughly a fifth of cases, and switched from a correct to an incorrect prescription in about 5 percent of all cases after receiving wrong prompts.

Two features of this evidence deserve emphasis. First, the systems that produced these errors were, on average, beneficial. Automation bias is not the opposite of a useful tool. It is the tax a useful tool imposes when its interaction design ignores human cognition. Second, the effect concentrates precisely where clinical AI is most needed. Time pressure, high patient volume, wide scope of practice, and thin diagnostic support are the working conditions of primary care in most of the world, and they are the conditions under which people defer most readily to a confident machine. The Penda finding that harmful recommendations reached documentation in a majority-unmodified workflow is not an anomaly. It is what the human-factors literature predicts.

This reframes what regulators and health systems usually mean by human in the loop. The phrase implies that a clinician standing between the model and the patient converts an imperfect system into a safe one. The evidence says the loop is porous by default. A reviewer under load, prompted by a fluent recommendation, will pass a meaningful fraction of errors through. Calling that arrangement a safety strategy does not make it one.

The interaction is the intervention

If the model is rarely the bottleneck, the design of the interaction is the intervention. This is visible in the Penda work itself, and it is the part most likely to be copied badly.

Several choices appear to matter. The tool ran passively and interrupted only at genuine decision points, rather than requiring clinicians to ask for help or burying them in advisories. It used a tiered signal, green for no concern, yellow for advisory, red for review, which rations attention and reserves interruption for cases that warrant it. It embedded local epidemiology and national guidelines in its prompts, so its recommendations were calibrated to Kenyan disease prevalence rather than to the distribution of an American training set. Each of these is a decision about cognition and workflow, not about model weights.

The counterexample is alert fatigue, the most predictable way to destroy the value of a clinical AI. A system that flags too often trains clinicians to dismiss it, and a dismissed alert protects no one. There is a genuine tension here that clean demonstrations tend to hide. A system tuned to interrupt rarely will miss some errors. A system tuned to interrupt often will be ignored. The right operating point is not a property of the model. It is an empirical question about a specific clinic, its case mix, its staffing, and its tolerance for interruption, and it drifts over time as clinicians adapt. This is why a strong model paired with a naive interface can underperform a weaker model paired with a disciplined one.

Broader evidence reinforces the point that capability and outcome come apart. In a randomized study of physicians using a leading model as an assistant, giving clinicians access to the model did not reliably improve their performance in the way its standalone scores would suggest (Goh et al., 2025). The gap between what a model can do in isolation and what a clinician-plus-model system does in practice is the whole game. OpenAI's own framing for the Penda work, the model-implementation gap, names it. We would go further. The gap is not a temporary inconvenience on the way to better models. It is the permanent location of clinical AI safety.

How Curely approaches oversight

We build clinical AI for exactly the settings these studies describe, and our position is shaped by having to make oversight work rather than assume it. Three commitments follow.

We treat oversight as a measured quantity, not a design assumption. If a clinician review step exists, we instrument it. We track how often recommendations are accepted, how often they are overridden, and how often an override was later judged correct. An oversight loop that no one measures is a loop that no one can trust, and unmeasured acceptance rates are how harmful outputs quietly become documentation. The Penda audit is a preview of what every deployed system will find if it looks, and most do not look.

We design against automation bias rather than around it. That means rationing interruption, reserving the strongest signal for the highest-stakes decisions, and making the basis for a recommendation legible so a clinician can evaluate it rather than defer to it. It also means resisting interface choices that manufacture false confidence, such as fluent prose that reads as authoritative regardless of the underlying uncertainty. Fluency is not calibration, and in a clinical setting the difference is a safety property.

We build for the infrastructure that exists, not the infrastructure we wish existed. A copilot that assumes reliable connectivity, abundant compute, and complete diagnostics is a copilot for a hospital that most of the world's patients will never enter. Intermittent connectivity, constrained hardware, and limited laboratory support are not edge cases in the systems we serve. They are the design center. Oversight has to remain reliable when a network drops mid-consultation and when a clinician is seeing a patient every few minutes, or it is not oversight at all.

Where we differ from the prevailing conversation is on emphasis. Much of the field still treats the model as the object of safety work and the clinician as a given. We treat the clinician, the interface, and the workflow as the system under test, and the model as one replaceable component within it.

What this means for evaluation and regulation

The practical implication is that exam-style evaluation, still the field's default, does not predict deployed safety. Models now pass medical licensing examinations and score well on pan-African question sets such as AfriMed-QA, which was built precisely because Western benchmarks did not reflect African clinical reality and later helped train open medical models (Olatunji et al., 2025). In one benchmark, a panel of general models outperformed community health workers across every metric on thousands of clinical questions (Ntawukuriryayo et al., 2026). These results are meaningful, but they measure knowledge recall, not the behavior of a clinician-plus-model system under load. A model that answers questions well can still generate harmful recommendations that a rushed reviewer accepts. Benchmark performance and deployed safety are different variables, and the second is the one patients experience.

Deployment-grade evaluation would measure the loop, not just the model. It would report override and acceptance rates, the share of harmful outputs that reach the record, the calibration of the system's confidence, and how all of this changes as clinicians adapt over weeks. The World Health Organization's 2024 guidance on large multimodal models for health, with its more than forty recommendations for governments, developers, and providers, points in this direction by calling for rigorous evaluation and ongoing monitoring rather than one-time accuracy checks (World Health Organization, 2024). Ministries of health procuring these systems should ask vendors for oversight metrics from real deployments, not benchmark scores, and should treat the absence of such metrics as a finding in itself.

The work that remains

The honest summary is that clinical AI can measurably reduce errors in frontline care, and can simultaneously introduce harm that human review does not catch, and that the difference between those outcomes is decided mostly by design and measurement rather than by model quality. We still lack randomized evidence on patient outcomes, we do not yet understand how these systems affect clinical skill over months and years, and we cannot assume that a result from Nairobi transfers to a different disease burden or a different language without testing.

None of this argues for slowing down. It argues for locating the effort correctly. The next decade of progress in clinical AI will come less from larger models than from better-instrumented loops, from interfaces built for how clinicians actually think under pressure, and from a willingness to measure oversight instead of asserting it. Human oversight will remain essential. It will not be a strategy until we treat it as an engineering problem with failure modes to be tested, and build accordingly.


References

Goddard, K., Roudsari, A., & Wyatt, J. C. (2012). Automation bias: A systematic review of frequency, effect mediators, and mitigators. Journal of the American Medical Informatics Association, 19(1), 121–127. https://doi.org/10.1136/amiajnl-2011-000089

Goh, E., Gallo, R., Hom, J., Strong, E., Weng, Y., Kerman, H., et al. (2025). GPT-4 assistance for improvement of physician performance on patient care tasks: A randomized controlled trial. Nature Medicine, 31, 1233–1238. https://doi.org/10.1038/s41591-024-03456-y

Kimani, J., Korom, R., et al. (2026). Safety of a large language model-based clinical decision support system in African primary healthcare. Nature Health. https://doi.org/10.1038/s44360-026-00082-5

Ntawukuriryayo, J. D., et al. (2026). Large language models for frontline healthcare support in low-resource settings. Nature Health. https://doi.org/10.1038/s44360-025-00038-1

Olatunji, T., et al. (2025). AfriMed-QA: A Pan-African, multi-specialty, medical question-answering benchmark dataset. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (pp. 1948–1973). Association for Computational Linguistics.

Penda Health & OpenAI. (2025). AI-based clinical decision support for primary care: A real-world study (arXiv:2507.16947). arXiv. https://arxiv.org/abs/2507.16947

Rosbach, E., et al. (2024). Automation bias in AI-assisted medical decision-making under time pressure in computational pathology (arXiv:2411.00998). arXiv. https://arxiv.org/abs/2411.00998

World Health Organization. (2024). Ethics and governance of artificial intelligence for health: Guidance on large multi-modal models. Geneva: World Health Organization. https://www.who.int/publications/i/item/9789240084759