Autonomous Karyotype Data Collector
Autonomous agent for building the Coleoptera karyotype database. It searches PubMed and CrossRef, triages papers, fetches PDFs, and extracts structured records using Claude AI.
How it works
The agent runs a four-stage loop. Taxa found in new extractions generate additional search queries, so coverage expands automatically.
730 query terms
likely / uncertain / skip
→ CrossRef → abstract
structured JSON → CSV
new searches
The loop runs until 10,000 records are collected, all 730 search terms are exhausted, or five consecutive rounds return no new papers. State is persisted to disk and the agent resumes cleanly.
Key capabilities
Handles old monographs, Russian-language journals, and multi-species tables.
Dual-model AI pipeline
Claude Haiku handles triage at roughly 10x lower cost than Sonnet. Claude Sonnet does the full extraction, producing validated JSON with a confidence score for every record.
Domain-aware extraction
Every extraction prompt includes a 330-line cytogenetics guide encoding sex chromosome notation, B chromosome rules, ploidy conventions, and family-specific notes for 11 priority beetle families.
Multi-language, any era
The agent processes papers in Russian, Japanese, Spanish, Portuguese, and German from the 1930s onward. Soviet cytogenetics literature from the 1950s through 1980s is a priority target given how underrepresented it is in modern databases.
Priority PDF inbox
Place any PDF into priority_pdfs/ and it gets processed immediately at full extraction quality, bypassing triage. Old monographs, scanned journals, and personal copies all work. Processed files move to done/ automatically.
Automatic data quality flags
Records are flagged for review when 2n falls outside the expected range, the sex chromosome system conflicts with known family patterns, the paper is a secondary source, or confidence falls below 0.75.
Live dashboard + Alex AI
A local web dashboard at port 8787 shows record counts, family coverage, confidence distributions, and the agent log. An embedded Claude Haiku assistant (Alex) can start, stop, and restart the agent and edit the extraction guide through a chat interface.
Output schema
Each record is a row in results.csv with 33 fields covering paper metadata, taxonomy, core karyotype data, cytogenetic methods, collection provenance, and extraction quality.
| Field | Group | Description |
|---|---|---|
| doi, paper_title, paper_authors, paper_year, paper_journal | Paper | Source publication metadata |
| species, family, genus | Taxonomy | Species as given in paper; family updated to current classification |
| chromosome_number_2n | Karyotype | Standard diploid complement (Bs excluded) |
| n_haploid | Karyotype | Haploid number from meiotic preparations |
| sex_chr_system | Karyotype | Recorded verbatim - XY, Xyp, X0, neo-XY, X₁X₂Y, ZW, etc. |
| sex_of_specimen | Karyotype | male / female / both (separate rows when 2n differs by sex) |
| karyotype_formula | Karyotype | Verbatim formula, e.g. 2n = 18A + XY |
| haploid_autosome_count | Karyotype | (2n − sex chr count) / 2 - the A value for comparative work |
| chromosome_morphology | Karyotype | M / SM / ST / T classes; verbatim formula if given |
| ploidy, b_chromosomes | Karyotype | Explicit ploidy level; B count range e.g. "0–2" |
| staining_method, NOR_position, heterochromatin_pattern | Methods | Cytogenetic technique; nucleolus organizer location; C-banding pattern |
| collection_locality, voucher_info, collection_year, number_of_specimens | Provenance | Geographic and specimen-level metadata when reported |
| extraction_confidence | Quality | 0–1 score from the extraction model; records below 0.75 are discarded |
| flag_for_review | Quality | True when 2n is unusual, data conflicts with known records, or confidence is borderline |
| source_type, notes, pdf_url, processed_date | Meta | full_text / abstract; extraction notes; PDF link; timestamp |
Taxonomic scope
All four Coleoptera suborders, with 11 priority families searched first and most thoroughly.
Chrysomelidae
Cerambycidae
Carabidae
Scarabaeidae
Tenebrionidae
Staphylinidae
Elateridae
Buprestidae
Lucanidae
Running the agent
Command-line interface; fully autonomous or in targeted test modes.
State is persisted in JSON files - the agent resumes cleanly. The PDF cache is keyed by DOI hash; safe to delete to force re-downloads.
Background and motivation
Coleoptera has roughly 400,000 described species, but cytogenetic data are scattered across hundreds of journals in multiple languages spanning nearly a century of publications. Compiling this literature by hand is a multi-year undertaking.
The existing Coleoptera Karyotype Database (4,960 records from 252 papers, compiled through 2022) was built by hand. This agent is designed to continue that work: reading papers, extracting structured records, and flagging ambiguous cases for review.
The project is part of the lab's broader AI research program focused on autonomous agents for data-intensive biology.
Related resources