Autonomous Karyotype Data Collector

All karyotype databases

629+

Records extracted

730

Search terms

144

Beetle families targeted

33

Fields per record

2

AI models in pipeline

PubMed - active CrossRef - active Semantic Scholar - rate-limited, disabled Google Scholar - legally ambiguous, disabled Unpaywall open-access PDFs - active

How it works

The agent runs a four-stage loop. Taxa found in new extractions generate additional search queries, so coverage expands automatically.

🔍

Search

PubMed + CrossRef
730 query terms

→

⚡

Triage

Claude Haiku
likely / uncertain / skip

→

📄

Fetch PDF

pdf_url → Unpaywall
→ CrossRef → abstract

→

🧬

Extract

Claude Sonnet
structured JSON → CSV

→

🔄

Generate queries

New taxa feed
new searches

The loop runs until 10,000 records are collected, all 730 search terms are exhausted, or five consecutive rounds return no new papers. State is persisted to disk and the agent resumes cleanly.

Key capabilities

Handles old monographs, Russian-language journals, and multi-species tables.

🤖

Dual-model AI pipeline

Claude Haiku handles triage at roughly 10x lower cost than Sonnet. Claude Sonnet does the full extraction, producing validated JSON with a confidence score for every record.

📖

Domain-aware extraction

Every extraction prompt includes a 330-line cytogenetics guide encoding sex chromosome notation, B chromosome rules, ploidy conventions, and family-specific notes for 11 priority beetle families.

🌐

Multi-language, any era

The agent processes papers in Russian, Japanese, Spanish, Portuguese, and German from the 1930s onward. Soviet cytogenetics literature from the 1950s through 1980s is a priority target given how underrepresented it is in modern databases.

📥

Priority PDF inbox

Place any PDF into priority_pdfs/ and it gets processed immediately at full extraction quality, bypassing triage. Old monographs, scanned journals, and personal copies all work. Processed files move to done/ automatically.

🛡️

Automatic data quality flags

Records are flagged for review when 2n falls outside the expected range, the sex chromosome system conflicts with known family patterns, the paper is a secondary source, or confidence falls below 0.75.

📊

Live dashboard + Alex AI

A local web dashboard at port 8787 shows record counts, family coverage, confidence distributions, and the agent log. An embedded Claude Haiku assistant (Alex) can start, stop, and restart the agent and edit the extraction guide through a chat interface.

Output schema

Each record is a row in results.csv with 33 fields covering paper metadata, taxonomy, core karyotype data, cytogenetic methods, collection provenance, and extraction quality.

Field	Group	Description
doi, paper_title, paper_authors, paper_year, paper_journal	Paper	Source publication metadata
species, family, genus	Taxonomy	Species as given in paper; family updated to current classification
chromosome_number_2n	Karyotype	Standard diploid complement (Bs excluded)
n_haploid	Karyotype	Haploid number from meiotic preparations
sex_chr_system	Karyotype	Recorded verbatim - XY, Xyp, X0, neo-XY, X₁X₂Y, ZW, etc.
sex_of_specimen	Karyotype	male / female / both (separate rows when 2n differs by sex)
karyotype_formula	Karyotype	Verbatim formula, e.g. 2n = 18A + XY
haploid_autosome_count	Karyotype	(2n − sex chr count) / 2 - the A value for comparative work
chromosome_morphology	Karyotype	M / SM / ST / T classes; verbatim formula if given
ploidy, b_chromosomes	Karyotype	Explicit ploidy level; B count range e.g. "0–2"
staining_method, NOR_position, heterochromatin_pattern	Methods	Cytogenetic technique; nucleolus organizer location; C-banding pattern
collection_locality, voucher_info, collection_year, number_of_specimens	Provenance	Geographic and specimen-level metadata when reported
extraction_confidence	Quality	0–1 score from the extraction model; records below 0.75 are discarded
flag_for_review	Quality	True when 2n is unusual, data conflicts with known records, or confidence is borderline
source_type, notes, pdf_url, processed_date	Meta	full_text / abstract; extraction notes; PDF link; timestamp

Taxonomic scope

All four Coleoptera suborders, with 11 priority families searched first and most thoroughly.

Priority families

Curculionidae
Chrysomelidae
Cerambycidae
Carabidae
Scarabaeidae
Tenebrionidae

Coccinellidae
Staphylinidae
Elateridae
Buprestidae
Lucanidae

Running the agent

Command-line interface; fully autonomous or in targeted test modes.

# Full autonomous run
python agent.py

# Stop after 50 papers (testing)
python agent.py --limit 50

# Verify cascade on a known good DOI
python agent.py --test-paper 10.3897/compcytogen.v10i3.9504

# Launch live dashboard at localhost:8787
python dashboard.py
            

State is persisted in JSON files - the agent resumes cleanly. The PDF cache is keyed by DOI hash; safe to delete to force re-downloads.

Background and motivation

Coleoptera has roughly 400,000 described species, but cytogenetic data are scattered across hundreds of journals in multiple languages spanning nearly a century of publications. Compiling this literature by hand is a multi-year undertaking.

The existing Coleoptera Karyotype Database (4,960 records from 252 papers, compiled through 2022) was built by hand. This agent is designed to continue that work: reading papers, extracting structured records, and flagging ambiguous cases for review.

The project is part of the lab's broader AI research program focused on autonomous agents for data-intensive biology.

Related resources

🪲

Coleoptera Karyotype Database

Browse or download 4,959 curated beetle karyotype records with interactive filters, boxplots, and citations.

🗂️

All karyotype databases

Coleoptera, Diptera, Amphibia, Mammals, Drosophila, and Polyneoptera - six interactive databases from the Blackmon Lab.

🔬

Lab research

Theoretical population genetics, sex chromosome evolution, comparative genomics, and AI-native biology at Texas A&M.