Autonomous Karyotype Data Collector

Autonomous agent for building the Coleoptera karyotype database. It searches PubMed and CrossRef, triages papers, fetches PDFs, and extracts structured records using Claude AI.

Actively collecting 🪲 Coleoptera 🤖 Claude AI 📄 PubMed + CrossRef
All karyotype databases
629+
Records extracted
730
Search terms
144
Beetle families targeted
33
Fields per record
2
AI models in pipeline
PubMed - active CrossRef - active Semantic Scholar - rate-limited, disabled Google Scholar - legally ambiguous, disabled Unpaywall open-access PDFs - active

How it works

The agent runs a four-stage loop. Taxa found in new extractions generate additional search queries, so coverage expands automatically.

🔍
Search
PubMed + CrossRef
730 query terms
Triage
Claude Haiku
likely / uncertain / skip
📄
Fetch PDF
pdf_url → Unpaywall
→ CrossRef → abstract
🧬
Extract
Claude Sonnet
structured JSON → CSV
🔄
Generate queries
New taxa feed
new searches

The loop runs until 10,000 records are collected, all 730 search terms are exhausted, or five consecutive rounds return no new papers. State is persisted to disk and the agent resumes cleanly.

Key capabilities

Handles old monographs, Russian-language journals, and multi-species tables.

🤖
Dual-model AI pipeline

Claude Haiku handles triage at roughly 10x lower cost than Sonnet. Claude Sonnet does the full extraction, producing validated JSON with a confidence score for every record.

📖
Domain-aware extraction

Every extraction prompt includes a 330-line cytogenetics guide encoding sex chromosome notation, B chromosome rules, ploidy conventions, and family-specific notes for 11 priority beetle families.

🌐
Multi-language, any era

The agent processes papers in Russian, Japanese, Spanish, Portuguese, and German from the 1930s onward. Soviet cytogenetics literature from the 1950s through 1980s is a priority target given how underrepresented it is in modern databases.

📥
Priority PDF inbox

Place any PDF into priority_pdfs/ and it gets processed immediately at full extraction quality, bypassing triage. Old monographs, scanned journals, and personal copies all work. Processed files move to done/ automatically.

🛡️
Automatic data quality flags

Records are flagged for review when 2n falls outside the expected range, the sex chromosome system conflicts with known family patterns, the paper is a secondary source, or confidence falls below 0.75.

📊
Live dashboard + Alex AI

A local web dashboard at port 8787 shows record counts, family coverage, confidence distributions, and the agent log. An embedded Claude Haiku assistant (Alex) can start, stop, and restart the agent and edit the extraction guide through a chat interface.

Output schema

Each record is a row in results.csv with 33 fields covering paper metadata, taxonomy, core karyotype data, cytogenetic methods, collection provenance, and extraction quality.

Field Group Description
doi, paper_title, paper_authors, paper_year, paper_journalPaperSource publication metadata
species, family, genusTaxonomySpecies as given in paper; family updated to current classification
chromosome_number_2nKaryotypeStandard diploid complement (Bs excluded)
n_haploidKaryotypeHaploid number from meiotic preparations
sex_chr_systemKaryotypeRecorded verbatim - XY, Xyp, X0, neo-XY, X₁X₂Y, ZW, etc.
sex_of_specimenKaryotypemale / female / both (separate rows when 2n differs by sex)
karyotype_formulaKaryotypeVerbatim formula, e.g. 2n = 18A + XY
haploid_autosome_countKaryotype(2n − sex chr count) / 2 - the A value for comparative work
chromosome_morphologyKaryotypeM / SM / ST / T classes; verbatim formula if given
ploidy, b_chromosomesKaryotypeExplicit ploidy level; B count range e.g. "0–2"
staining_method, NOR_position, heterochromatin_patternMethodsCytogenetic technique; nucleolus organizer location; C-banding pattern
collection_locality, voucher_info, collection_year, number_of_specimensProvenanceGeographic and specimen-level metadata when reported
extraction_confidenceQuality0–1 score from the extraction model; records below 0.75 are discarded
flag_for_reviewQualityTrue when 2n is unusual, data conflicts with known records, or confidence is borderline
source_type, notes, pdf_url, processed_dateMetafull_text / abstract; extraction notes; PDF link; timestamp

Taxonomic scope

All four Coleoptera suborders, with 11 priority families searched first and most thoroughly.

Priority families
Curculionidae
Chrysomelidae
Cerambycidae
Carabidae
Scarabaeidae
Tenebrionidae
 
Coccinellidae
Staphylinidae
Elateridae
Buprestidae
Lucanidae

Running the agent

Command-line interface; fully autonomous or in targeted test modes.

# Full autonomous run python agent.py # Stop after 50 papers (testing) python agent.py --limit 50 # Verify cascade on a known good DOI python agent.py --test-paper 10.3897/compcytogen.v10i3.9504 # Launch live dashboard at localhost:8787 python dashboard.py

State is persisted in JSON files - the agent resumes cleanly. The PDF cache is keyed by DOI hash; safe to delete to force re-downloads.

Background and motivation

Coleoptera has roughly 400,000 described species, but cytogenetic data are scattered across hundreds of journals in multiple languages spanning nearly a century of publications. Compiling this literature by hand is a multi-year undertaking.

The existing Coleoptera Karyotype Database (4,960 records from 252 papers, compiled through 2022) was built by hand. This agent is designed to continue that work: reading papers, extracting structured records, and flagging ambiguous cases for review.

The project is part of the lab's broader AI research program focused on autonomous agents for data-intensive biology.