AI Agents in Drug Discovery, Strong on the Science and Shaky on the Workflow, Curely AI

Share

In 2026, AI agents can perform much of the early science of drug discovery competently, and they still fail at stringing it together. The frontier problem is not knowledge. Current coding agents recognize the right protein structures, know the right databases, and execute individual cheminformatics steps well. What they lack is the ability to plan a long, dependent workflow and hold every constraint from the first step to the last. And for health systems working with limited resources, a second constraint matters more than model quality, which is whether the data and tools an agent depends on exist for your patients at all. This is a review of what the current benchmarks and the first clinical readouts actually show, and what that means for anyone deciding where to place trust and budget.

What an AI agent does in drug discovery, and what it does not

An AI agent in this context is not a chatbot answering questions. It plans a multi-step workflow, calls specialized scientific tools such as structure parsers, sequence editors, cheminformatics libraries, and database clients, often writes and runs its own code, and works toward a goal across many steps. In benchmark settings these agents typically run inside an environment like Biomni, an open biomedical agent framework from Stanford's SNAP lab that exposes a few hundred functions the agent calls like any Python library. The agent decides which functions to use, whether to lean on the library or write fresh code, and how many steps to take. That autonomy is the point, and it is also where the difficulty lives.

The best current read on agent skill says one thing three ways

The most detailed public evaluation to date is Scale Labs' DrugDiscoveryBench, which tested three frontier coding agents, Claude Code (Opus 4.7), Codex (GPT-5.5), and Gemini CLI (Gemini 3.1 Pro), on 66 expert-curated, verifiable tasks spanning target identification, hit discovery, hit-to-lead, and lead optimization (emerging evidence, early internal report whose numbers are expected to move). Mean outcome accuracy across the tasks all three answered was 46 percent, 62 percent, and 53 percent respectively. On 43 of the 66 tasks at least one agent produced a correct answer; on the remaining 23 none did.

The pattern inside those numbers is the useful part. Agents were most reliable on chemistry and structure tasks, which tend to be deterministic with few decision points between the data and the answer. They struggled most on long retrieval and biology chains that string many database queries and filters together. One worked example makes the failure concrete. Asked for a melanoma protein marker, all three agents ranked candidate genes by their total pathogenic-variant count and returned genes such as BRCA2 or PTEN, when the answer is CDKN2A. They dropped the melanoma scope at the final counting step, a planning slip rather than a gap in biological knowledge. The decisive finding supports that reading: when the agents were handed an expert's method, meaning the sequence of steps and which tools to use but not the answer, most previously unsolved tasks became solvable. The bottleneck is high-level planning, not the underlying science.

Two other evaluations point the same way. Deep Origin's DO Challenge reported an agentic system scoring 33.5 percent in a time-limited setup, near the top human expert's 33.6 percent and well above the best human team's 16.4 percent (limited evidence, company report). A separate preclinical-pharmacology benchmark, TxBench-PP, put its strongest agent configuration near 59 percent across 100 tasks (emerging evidence, preprint report). Different tasks, same message: competent on scoped, verifiable steps, unreliable across long chains. These are early, mostly non-peer-reviewed evaluations, so the levels should be read as directional rather than precise.

The lab is not the clinic, and the attrition math makes that unavoidable

The first peer-reviewed clinical proof-of-concept arrived in 2025. Rentosertib, formerly ISM001-055, is a TNIK inhibitor whose target and molecule both came from Insilico Medicine's generative AI platform. Its Phase 2a trial in idiopathic pulmonary fibrosis was published in Nature Medicine (moderate evidence, single small randomized controlled trial). Across 71 patients at 22 sites in China over 12 weeks, the safety primary endpoint was met, and the 60 mg once-daily arm showed a mean forced vital capacity change of +98.4 mL against a decline of 20.3 mL on placebo. That is a genuine milestone and the first peer-reviewed Phase 2a result for a molecule and target both produced by generative AI. It is also small, short, and single-country, and the authors call for larger and longer trials, so it should not be read as efficacy established.

The context that keeps this honest is the base rate. Across two decades of data, only about 13.8 percent of drugs that enter human trials reach approval, and in oncology closer to 3.4 percent (strong evidence, large peer-reviewed analysis). AI has shortened parts of early discovery, with reviews reporting preclinical timelines compressed by roughly a third (moderate evidence, secondary review). But the rate-limiting steps, which are clinical trial duration, patient enrollment, and regulatory review, are set by biology and policy, not by how fast a molecule was designed. As of early-2026 reviews, no fully AI-discovered drug had received FDA approval (same secondary review). Claims of end-to-end acceleration usually conflate faster discovery with faster development. They are not the same thing.

The workflow gap is an architecture problem, not only a model problem

The benchmark's clearest lesson has a direct design implication. If long, dependent chains are where agents fail, and if supplying a plan rescues them, then the useful unit is not a single all-knowing model but a system that breaks the work into scoped steps, verifies each one, and keeps a human at the decision points. In practice that means checking intermediate results before they propagate, and instrumenting the plan so a wrong turn is caught early rather than compounding into a confident wrong answer. This is the case for multi-agent designs with explicit planning and human oversight, and it is the approach Curely takes to clinical work more broadly: grant autonomy in proportion to how well a step can be verified, and hold professional accountability at the points that matter. The same principle that separates a reliable discovery agent from an unreliable one separates a safe clinical assistant from a risky one.

In resource-constrained systems, the binding constraint is data and tools, not model quality

An agent is only as good as the databases and tools it can reach, and most of those were built on data from high-income populations. That creates a specific problem for everyone else. Pharmacogenetic data describing how African populations metabolize common drugs is scarce, and few validated tools exist to predict it (limited evidence, early computational work). The H3D Centre at the University of Cape Town, Africa's first integrated drug discovery center, is building these missing pieces with the Ersilia Open Source Initiative, including transfer-learning models for African genetic variants in malaria and tuberculosis drug metabolism, and an automated screening pipeline, ZairaChem, designed to run in low-resource settings (limited to emerging evidence, peer-reviewed methods and early deployments).

The lesson for anyone deploying agentic tools in these settings is that the leverage is not a bigger model. It is whether the reference data and validated tools exist for your patient population, and whether they can be kept running on the infrastructure you have. Curely treats that as the central problem rather than an afterthought, because advanced healthcare intelligence is only useful where it reaches the people who need it.

Where the value is real now, and where it is not

For clinicians, administrators, and health-tech decision-makers weighing AI in discovery or adjacent analytics, the current evidence supports a narrow and defensible read. Agents add real value as tireless analysts for scoped, verifiable early-discovery tasks, which is structure reading, property calculation, database screening, and literature triage, and they perform best when a human sets the plan. They are not yet reliable at running long, autonomous workflows end to end, and they do not change the economics of clinical development. When a vendor claims otherwise, three questions usually separate signal from headline: which specific tasks were measured and against what ground truth, whether the workflow was long-horizon or single-step, and whether the underlying data and tools represent your patient population. The honest answers are usually more modest than the pitch, and more useful.

AI Agents in Drug Discovery, Strong on the Science and Shaky on the Workflow

What an AI agent does in drug discovery, and what it does not

The best current read on agent skill says one thing three ways

The lab is not the clinic, and the attrition math makes that unavoidable

The workflow gap is an architecture problem, not only a model problem

In resource-constrained systems, the binding constraint is data and tools, not model quality

Where the value is real now, and where it is not

Related reading

The Deployment Gap Is the Point: Why the Most Honest AI in Healthcare Is Being Built Where the Internet Drops

Human Oversight Is Necessary but Not a Safety Strategy for Clinical AI

How Curely AI Is Putting Clinician Pajama Time to Rest

Put it into practice

CurelyHMS

Patient Intelligence

AI Clinical Assistance