HOME › AI AGENTS › TRAITTRAWLER

TraitTrawler

An autonomous AI agent pipeline that builds structured, citation-grounded scientific databases from the primary literature. Point it at a taxon and a trait: it finds the papers, fetches the PDFs, extracts the data, and verifies every claim against the source before writing a single row. No domain-specific logic lives in the pipeline; all of it lives in per-project validator files that you approve.

The pipeline is open-source under MIT. Any researcher with a Claude API account can deploy it on a laptop. Our Coleoptera karyotype run cost roughly $350 in API fees across four sessions; our cross-kingdom tau run cost about $100. Cost scales with corpus size, not lab size.

v6.2

Current version

Specialist subagents

Deterministic Python scripts

2024

First karyotype run

TraitTrawler on GitHub → · Methods paper presubmission draft (PDF) →

The central guarantee No row is written unless the claim appears verbatim in the source PDF Grounding is a protocol invariant enforced by a deterministic Python script, not a best-effort field.

The problem with AI literature extraction isn't hallucination in the obvious sense: the model confidently reporting something no paper ever said. The subtler failure is correct-sounding values that are on the right page but in the wrong row of a table, or plausible numbers assembled from two different papers. TraitTrawler's answer is a hard gate: before any extracted claim reaches the output CSV, a deterministic Python script checks that a verbatim quote from the extraction literally appears on the cited page of the SHA256-hashed source PDF. If the quote isn't there, the row is dropped. There is no fallback. Grounding is enforced, not hoped for.

Behind the grounding gate sits a second layer: every project ships with user-approved Python validator files (domain-specific range checks, arithmetic consistency rules, and notation validators) that also run on every row. These are proposed by the pipeline after reading seed papers, vetted by a static linter that blocks any unsafe code before you ever see them, and only activated after your approval. A row that passes grounding but fails a validator is routed to a review queue rather than written to the CSV.

The audit trail behind each accepted row is publishable: source PDF hash, page number, verbatim quote, model versions, and validator verdicts are written to an append-only ledger file in Darwin Core + PAV + PROV-O format.

Trait-agnostic by design Zero domain-specific logic lives in the pipeline core All trait knowledge lives in per-project files the pipeline writes and you approve.

Most automated extraction systems are built around a specific domain: a tool for extracting clinical trial endpoints, or plant phenology records, or chemical properties. TraitTrawler's architecture enforces the opposite: the core pipeline contains no karyotype logic, no chromosome logic, no biology logic at all. When you start a project, the pipeline reads 5–10 seed papers for your trait, writes a trait profile summarizing how values are reported in the literature, proposes an output schema, and proposes a set of domain-specific validators. You approve each validator individually. Then the pipeline runs.

The same pipeline that assembled our Coleoptera karyotype database works for avian body mass, plant flowering date, drug dosing from case reports, or materials-science thermal properties. The contribution is the architecture. Spinning up a new trait domain requires seed papers and an approval session; no code changes to the pipeline.

This is enforced architecturally, not by convention. The agnostic hooks (grounding, schema validity, deduplication, GBIF taxonomy) run on every project. The trait-specific hooks run only on the project that generated them. They are kept in separate files and loaded separately. A hook that somehow escaped its project could not corrupt another project's data.

How it works An 8-phase state machine with deterministic gates at each step A Manager subagent orchestrates specialists: triage on Haiku, extraction on Opus, verification on Sonnet, with CoVe blind re-extraction as the critic.

The pipeline runs as an 8-phase state machine inside Claude Code. A Manager subagent orchestrates specialist subagents, each running on the model appropriate to its task: Haiku 4.5 for fast relevance triage, Opus 4.7 at maximum effort for extraction and final adjudication, Sonnet 4.6 for semantic verification. The Manager stays lean; it never reads PDFs directly, only tracks state and dispatches subagents. This keeps the main context from filling across 500-paper runs.

Within the extraction loop, each paper goes through: a deterministic code-execution prefilter that narrows the PDF to pages with trait signal (so triage reads kilobytes, not tens of kilobytes), triage → extraction (text + image at 2576px for table and figure pages) → grounding verification → Chain-of-Verification (the critic blindly re-extracts the value from the verbatim quote without seeing the extractor's proposed answer, then compares; this beats rubric scoring by roughly twenty F1 points on scientific information extraction) → escalation to an Opus advisor on uncertainty → schema validation → project-specific hooks → optional adjudication. Only papers that clear every gate write to the CSV. The rest are logged with reasons.

The pipeline narrates at declared checkpoints and runs autonomously between them. The Manager writes a compact checkpoint to disk every ten batches and reads it back first thing on every turn, so a 500-paper run can survive Claude Code's auto-compaction without losing track of where it is. You can leave it running and come back to a partially completed database, a review queue of flagged records, and a session report summarizing coverage and failure patterns.

What's new in v6.2 CoVe verifier, code-execution triage prefilter, compaction-safe Manager Three changes that make the pipeline measurably more accurate, an order of magnitude cheaper at triage, and resilient across long autonomous runs.

Chain-of-Verification critic. The semantic verifier no longer rubric-scores the extractor's proposed value. It blindly re-extracts the trait from the verbatim quote without seeing the original answer and then compares. The published evidence (Dhuliawala et al. 2023 and the 2025 VeriCoT follow-up) shows this beats rubric-style verifiers by roughly twenty F1 points on scientific IE specifically because the critic does not anchor on the draft. Both extractions are recorded in the ledger as part of the audit trail.

Code-execution triage prefilter. Anthropic's Code Execution with MCP pattern showed that running deterministic code before sending data to a model can cut token usage by an order of magnitude. The prefilter applies that here: it parses the PDF, runs a regex/keyword filter built from the project's trait profile, and returns only the pages that carry signal plus a small surrounding context. Haiku triages roughly two to five thousand tokens per paper instead of twenty to eighty thousand, with no loss of recall on relevant pages.

Compaction-safe Manager. Long autonomous runs eventually trigger Claude Code's context compaction, which historically corrupted the Manager's sense of where it was in the pipeline. The Manager now writes a chronological plain-markdown checkpoint to state/manager_checkpoint.md every ten batches and reads it back as the first thing on every turn. The session log accumulates as an append-only narrative. The Manager's effective memory now lives on disk, not in context.

Adaptive self-consistency. The verifier's thinking effort scales with claim difficulty rather than running at a fixed level on every claim. Easy claims get a quick check; ambiguous ones get full reasoning and may escalate to the Opus advisor. Aggregate cost goes down without sacrificing accuracy on the cases that matter.

Operational improvements in this release: parallelized IO-bound Python work for batch scaling, fuzzy PDF pairing on bootstrap so renamed files still match curated rows, a migration preflight that audits a project's state before upgrading, and a dialogue-first bootstrap that walks the user through curated-data ingestion conversationally rather than by config file.

Bootstrap Bring your existing curated dataset: it becomes ground truth The pipeline ingests existing CSVs as human-validated anchors and derives soft validators from their distributions.

Most labs starting a trait database already have years of manually curated data. TraitTrawler treats this as a first-class input. You supply your curated CSV (and optionally the paired PDFs), and the bootstrap subagent ingests each row as human-validated ground truth: SHA256-hashed into the manifest, marked ValidatedByHuman in Darwin Core provenance, written to the ledger, and treated as the anchor the AI extractor will not re-extract unless you explicitly request it.

The bootstrap also derives soft validators from your curated data: range checks and categorical constraints generated from the actual distributions in your data, with 20% padding so novel-but-correct values are flagged for review rather than rejected. These are proposed to you for approval exactly like any other validator.

When we built the karyotype database, we had nearly three decades of manually compiled records in our existing datasets. The bootstrap turned that prior work into a head start rather than something to throw away.

Where it came from Built to collect karyotype data, generalized to anything Version one was a cytogenetics-specific agent. Version six is domain-agnostic.

The first version of this pipeline was purpose-built to collect diploid chromosome counts from the entomological literature, a specific task with known notation conventions, a known journal set, and an existing curated dataset to bootstrap from. It worked well enough that we ran it for a semester's worth of literature, and the output is our CUREs karyotype database.

The problem with version one was that all the trait knowledge was hardcoded. The chromosome-count range checks, the notation conventions, the taxonomy validation rules: all of it lived in the pipeline itself. Adapting it to a different trait meant rewriting core logic. Version six breaks that coupling entirely. The trait knowledge is now learned per-project and stored in files outside the pipeline. The pipeline is the same regardless of what you point it at.

The karyotype database is still the best proof that the architecture works. It is also no longer the reason to use it.

TraitTrawler on GitHub → The karyotype database it built →