HOME › AI AGENTS › LEAD INVESTIGATOR

An AI agent, running a real research project.

We handed a Claude-powered agent a complete research problem in claw evolution and asked it to carry the whole arc: form hypotheses, pick comparative methods, run the analyses, draft the manuscript. We expect it to fail, and the failure is the point. Each breakdown teaches us how to design the next version. The goal is not a robot PI. It's an agent that works at our side as scientific collaborator, programmer, and an army of students and postdocs, doing meaningful work when the task is correctly scoped, tested, and examined.

v0.7
Current agent version
1
Complete project underway
Claude
Model backbone
Open
Every prompt, log, and failure published
Taskclaw evolution across Coleoptera, hypothesis through manuscript Statuscurrently working through candidate hypotheses, cross-checking against the literature Last human reviewtwo days ago, structured critique of comparative-method choice Failures loggedon track to publish; see the log once the project wraps

Why we're doing this

There is a lot of hype right now about AI agents doing science. Most of it is either demos on synthetic problems or benchmark tests on narrow tasks. Very little of it tells you whether an agent can carry a real evolutionary-biology project end to end. We wanted to know.

We also wanted a training corpus for the agent we actually want to build, which is not an autonomous scientist but a scientific collaborator. To design that, we need to understand where today's models break down on the specific shape of work we do. The only way to learn that is to push one until it breaks.

What the agent is doing

Problem Claw evolution in Coleoptera A real open question in the lab, not a synthetic benchmark.

Beetle tarsal claws vary wildly in shape, and the functional correlates are only partially understood. The agent was handed the same problem a beginning graduate student would get: morphological variation, a growing literature, and a phylogeny. It has access to comparative-methods libraries, the lab's karyotype and trait databases, and the relevant primary literature via retrieval. It does not have access to co-authors.

Autonomy The agent chooses the hypotheses, the methods, and the writeup Humans review structure; they don't supply it.

We give the agent a research brief and a budget of attempts. It picks which specific hypotheses to prioritize, which comparative methods to apply, how to evaluate model fit, and how to structure the writeup. We review its plan and its outputs, but we are explicitly avoiding the trap of "pretending" autonomy by prompting our way to a pre-written paper.

Tooling Real tools, real data, real logs Retrieval over primary literature, R packages, our own databases, and a structured scratchpad.

The agent runs real code, produces real figures, and logs every prompt, response, and artifact. No memory wipes between stages. When it gets a result that doesn't make biological sense, we want the record to show whether it noticed.

What we expect to fail

If a Claude-powered agent could run a complete evolutionary-biology project unassisted today, we would already know. Instead, we're predicting specific failure modes:

Failure mode 1 Hypothesis selection under uncertainty Picking something tractable instead of something right.

Agents are trained to complete tasks. A research project is not a task with a clear completion signal, especially at the hypothesis-selection stage. We expect the agent to gravitate toward hypotheses that are easy to test with available data rather than hypotheses most likely to be true.

Failure mode 2 Biological plausibility checks Running a model to completion even when the output is nonsense.

The agent can compute a statistic. We want to know if it notices when the statistic describes something biologically impossible, and whether it corrects course or publishes the number anyway.

Failure mode 3 Novel-result discipline Distinguishing what it found from what was already known.

Agents are trained to be confident. Evolutionary biology asks researchers to be precise about which part of a result is actually new. We expect this is where the draft manuscript will most obviously read as AI-written, and where the corrective intervention pays the most back.

Failure mode 4 Scope discipline Knowing when a project is done, versus sprawling into adjacent questions forever.

A graduate student learns this painfully over several years. We don't think an agent learns it at all without explicit structure. Watching where and how the scope escapes is probably the single most informative signal we'll get.

We are not trying to build a robot principal investigator. We are trying to learn, in public, what shape of agent is actually useful to a working scientist.

What "working" would look like

We are not measuring success by whether the agent publishes a paper. A bad paper published unassisted is worse than a good paper drafted with structured collaboration. The real measures:

If the project ends in a draft manuscript that we can turn into a joint submission with substantial human authorship, that's a win. If it ends in a documented record of what broke and why, that's also a win. The only losing outcome is a polished output that no one can verify.

How to follow along

Every prompt, every response, every generated artifact will be published when the project concludes. Expect a companion manuscript describing the methodology and the agent's behavior, intended as a case study for anyone else thinking about putting an agent on a real problem.

Question copied. Paste it into the NotebookLM tab.