Research · July 1, 2026 · kaddu livingstone · 7 min read
AI Agents in Drug Discovery, Strong on the Science and Shaky on the Workflow
AI agents already handle much of the early science of drug discovery, but they falter on long, multi-step workflows, and no AI-discovered drug is yet approved. Here is what the 2026 benchmarks and first clinical readouts actually show.

In 2026, AI agents can perform much of the early science of drug discovery competently, and they still fail at stringing it together. The frontier problem is not knowledge. Current coding agents recognize the right protein structures, know the right databases, and execute individual cheminformatics steps well. What they lack is the ability to plan a long, dependent workflow and hold every constraint from the first step to the last. And for health systems working with limited resources, a second constraint matters more than model quality, which is whether the data and tools an agent depends on exist for your patients at all. This is a review of what the current benchmarks and the first clinical readouts actually show, and what that means for anyone deciding where to place trust and budget.
What an AI agent does in drug discovery, and what it does not
An AI agent in this context is not a chatbot answering questions. It plans a multi-step workflow, calls specialized scientific tools such as structure parsers, sequence editors, cheminformatics libraries, and database clients, often writes and runs its own code, and works toward a goal across many steps. In benchmark settings these agents typically run inside an environment like Biomni, an open biomedical agent framework from Stanford's SNAP lab that exposes a few hundred functions the agent calls like any Python library. The agent decides which functions to use, whether to lean on the library or write fresh code, and how many steps to take. That autonomy is the point, and it is also where the difficulty lives.
The best current read on agent skill says one thing three ways
The most detailed public evaluation to date is Scale Labs' DrugDiscoveryBench, which tested three frontier coding agents, Claude Code (Opus 4.7), Codex (GPT-5.5), and Gemini CLI (Gemini 3.1 Pro), on 66 expert-curated, verifiable tasks spanning target identification, hit discovery, hit-to-lead, and lead optimization (emerging evidence, early internal report whose numbers are expected to move). Mean outcome accuracy across the tasks all three answered was 46 percent, 62 percent, and 53 percent respectively. On 43 of the 66 tasks at least one agent produced a correct answer; on the remaining 23 none did.
The pattern inside those numbers is the useful part. Agents were most reliable on chemistry and structure tasks, which tend to be deterministic with few decision points between the data and the answer. They struggled most on long retrieval and biology chains that string many database queries and filters together. One worked example makes the failure concrete. Asked for a melanoma protein marker, all three agents ranked candidate genes by their total pathogenic-variant count and returned genes such as BRCA2 or PTEN, when the answer is CDKN2A. They dropped the melanoma scope at the final counting step, a planning slip rather than a gap in biological knowledge. The decisive finding supports that reading: when the agents were handed an expert's method, meaning the sequence of steps and which tools to use but not the answer, most previously unsolved tasks became solvable. The bottleneck is high-level planning, not the underlying science.
Two other evaluations point the same way. Deep Origin's DO Challenge reported an agentic system scoring 33.5 percent in a time-limited setup, near the top human expert's 33.6 percent and well above the best human team's 16.4 percent (limited evidence, company report). A separate preclinical-pharmacology benchmark, TxBench-PP, put its strongest agent configuration near 59 percent across 100 tasks (emerging evidence, preprint report). Different tasks, same message: competent on scoped, verifiable steps, unreliable across long chains. These are early, mostly non-peer-reviewed evaluations, so the levels should be read as directional rather than precise.
The lab is not the clinic, and the attrition math makes that unavoidable
The first peer-reviewed clinical proof-of-concept arrived in 2025. Rentosertib, formerly ISM001-055, is a TNIK inhibitor whose target and molecule both came from Insilico Medicine's generative AI platform. Its Phase 2a trial in idiopathic pulmonary fibrosis was published in Nature Medicine (moderate evidence, single small randomized controlled trial). Across 71 patients at 22 sites in China over 12 weeks, the safety primary endpoint was met, and the 60 mg once-daily arm showed a mean forced vital capacity change of +98.4 mL against a decline of 20.3 mL on placebo. That is a genuine milestone and the first peer-reviewed Phase 2a result for a molecule and target both produced by generative AI. It is also small, short, and single-country, and the authors call for larger and longer trials, so it should not be read as efficacy established.
The context that keeps this honest is the base rate. Across two decades of data, only about 13.8 percent of drugs that enter human trials reach approval, and in oncology closer to 3.4 percent (strong evidence, large peer-reviewed analysis). AI has shortened parts of early discovery, with reviews reporting preclinical timelines compressed by roughly a third (moderate evidence, secondary review). But the rate-limiting steps, which are clinical trial duration, patient enrollment, and regulatory review, are set by biology and policy, not by how fast a molecule was designed. As of early-2026 reviews, no fully AI-discovered drug had received FDA approval (same secondary review). Claims of end-to-end acceleration usually conflate faster discovery with faster development. They are not the same thing.
The workflow gap is an architecture problem, not only a model problem
The benchmark's clearest lesson has a direct design implication. If long, dependent chains are where agents fail, and if supplying a plan rescues them, then the useful unit is not a single all-knowing model but a system that breaks the work into scoped steps, verifies each one, and keeps a human at the decision points. In practice that means checking intermediate results before they propagate, and instrumenting the plan so a wrong turn is caught early rather than compounding into a confident wrong answer. This is the case for multi-agent designs with explicit planning and human oversight, and it is the approach Curely takes to clinical work more broadly: grant autonomy in proportion to how well a step can be verified, and hold professional accountability at the points that matter. The same principle that separates a reliable discovery agent from an unreliable one separates a safe clinical assistant from a risky one.
In resource-constrained systems, the binding constraint is data and tools, not model quality
An agent is only as good as the databases and tools it can reach, and most of those were built on data from high-income populations. That creates a specific problem for everyone else. Pharmacogenetic data describing how African populations metabolize common drugs is scarce, and few validated tools exist to predict it (limited evidence, early computational work). The H3D Centre at the University of Cape Town, Africa's first integrated drug discovery center, is building these missing pieces with the Ersilia Open Source Initiative, including transfer-learning models for African genetic variants in malaria and tuberculosis drug metabolism, and an automated screening pipeline, ZairaChem, designed to run in low-resource settings (limited to emerging evidence, peer-reviewed methods and early deployments).
The lesson for anyone deploying agentic tools in these settings is that the leverage is not a bigger model. It is whether the reference data and validated tools exist for your patient population, and whether they can be kept running on the infrastructure you have. Curely treats that as the central problem rather than an afterthought, because advanced healthcare intelligence is only useful where it reaches the people who need it.
Where the value is real now, and where it is not
For clinicians, administrators, and health-tech decision-makers weighing AI in discovery or adjacent analytics, the current evidence supports a narrow and defensible read. Agents add real value as tireless analysts for scoped, verifiable early-discovery tasks, which is structure reading, property calculation, database screening, and literature triage, and they perform best when a human sets the plan. They are not yet reliable at running long, autonomous workflows end to end, and they do not change the economics of clinical development. When a vendor claims otherwise, three questions usually separate signal from headline: which specific tasks were measured and against what ground truth, whether the workflow was long-horizon or single-step, and whether the underlying data and tools represent your patient population. The honest answers are usually more modest than the pitch, and more useful.
Related reading
Research
The Deployment Gap Is the Point: Why the Most Honest AI in Healthcare Is Being Built Where the Internet Drops
One doctor per 5,000 people is considered Africa's healthcare deficit. We argue the opposite: that constraint is a forcing function, and the AI built to survive inside it, offline-first, solar-powered, locally grounded, is more honest than the AI being celebrated in Boston and London. 2026 is the year healthcare AI stopped being a demo and became infrastructure, and the hardest version of that shift is happening in low-resource settings, not despite their limits but because of them.
ReadHealthcare AI
Human Oversight Is Necessary but Not a Safety Strategy for Clinical AI
Evidence from primary care in Kenya shows clinical AI can cut errors while still passing harmful recommendations through human review. Oversight is necessary, but until it is designed and measured, it is not a safety strategy.
ReadClinical Documentation
How Curely AI Is Putting Clinician Pajama Time to Rest
After-hours documentation, known as pajama time, is a leading driver of clinician burnout. We review the evidence behind ambient AI documentation and explain how Curely approaches the burden where every clinical minute is scarce.
Read
Put it into practice
Hospital operating system
CurelyHMS
A connected hospital operating system — bed management, scheduling, supply, and revenue cycle in one intelligent layer.
ExplorePatient-centred AI
Patient Intelligence
Real-time patient profiles that surface risk, care gaps, and the right context at the right moment in care.
ExploreClinician copilot
AI Clinical Assistance
Clinician copilots for chart summarization, evidence retrieval, and documentation at the point of care.
Explore
