The CUREs karyotype database.
63,682 chromosome-number records spanning 55 eukaryotic clades, each tied to its primary source. The dataset was assembled over several years by rolling cohorts of undergraduate researchers in our CUREs program, and it ships as open JSON and CSV so other labs and AI agents can build on it without scraping. This page is the data backing the 2026 preprint.
How to use the data
Every record in this database comes from a primary source: a paper, dissertation, or dataset whose authors did the slow, careful work of generating these chromosome counts. Citations matter. They are how careers get evaluated, how grants get awarded, and how the people whose work we depend on get credit. Please cite responsibly.
- Using data from a single clade? Cite the original source for that clade (see the Sources tab, or the citation column in the data table). Do not cite this database in place of the primary work.
- Combining data across multiple clades? Cite this database (Copeland et al. 2026) and list the clade-level sources you drew on in your supplementary materials.
- Downloading the full dataset? The CSV file ships with a
citationcolumn for exactly this reason. Please carry that column through your analyses so the original authors can be credited downstream.
How to cite
Copeland, M., McConnell, M., Barboza, A., Abraham, H.M., Alfieri, J., Arackal, S., Bernard, C.E., Bryant, K., Cast, S., Chien, S., Clark, E., Cruz, C.E., Diaz, A.Y., Deiterman, O., Girish, R., Harper, K., Hjelmen, C.E., Thompson, M.J., Koehl, R., Koneru, T., Laird, K., Lee, Y., Lopez, V.R., Murphy, M., Perez, N., Schmalz, S., Sylvester, T., and Blackmon, H. (2026). Dismantling Chromosomal Stasis Across the Eukaryotic Tree of Life. bioRxiv 2026.04.14.718287. https://doi.org/10.64898/2026.04.14.718287
@article{Copeland2026.04.14.718287,
author = {Copeland, Megan and McConnell, Meghann and Barboza, Andres and Abraham, Hannah M and Alfieri, James and Arackal, Steven and Bernard, Carrie E and Bryant, Kiedon and Cast, Shelbie and Chien, Sean and Clark, Emily and Cruz, Cassandra E and Diaz, Aileen Y and Deiterman, Olivia and Girish, Riya and Harper, Kaya and Hjelmen, Carl E and Thompson, Michelle J and Koehl, Rachel and Koneru, Tanvi and Laird, Kenzie and Lee, Yoonseo and Lopez, Virginia R and Murphy, Mallory and Perez, Nayeli and Schmalz, Sarah and Sylvester, Terrence and Blackmon, Heath},
title = {Dismantling Chromosomal Stasis Across the Eukaryotic Tree of Life},
year = {2026},
doi = {10.64898/2026.04.14.718287},
publisher = {Cold Spring Harbor Laboratory},
journal = {bioRxiv},
elocation-id = {2026.04.14.718287},
url = {https://www.biorxiv.org/content/early/2026/04/16/2026.04.14.718287}
}
Browse the data
The interactive viewer below loads the full dataset and lets you filter by clade, search for species, sort any column, download the filtered subset as CSV, and plot distributions. All 63,682 records are available through the Data Table tab; source citations for each clade live on the Sources tab.
Records by Clade
How to Cite, Please Read
- Using data from a single clade? Cite the original source for that clade (see the Sources by Clade tab, or the citation column in the data table). Do not cite this database in place of the primary work.
- Combining data across multiple clades? Cite this database (Copeland et al. 2026) and list the clade-level sources you drew on in your supplementary materials.
- Downloading the full dataset? The CSV file ships with a
citationcolumn for exactly this reason. Please carry that column through your analyses so the original authors can be credited downstream.
| Clade | Species | Haploid Number | Citation |
|---|---|---|---|
| Loading data… | |||
About the CUREs program
This database is one output of the Biology & AI CURE at Texas A&M, a course-based research program that embeds undergraduates in real research from their first semester. The workflow that produced this dataset was genuinely collaborative between students, AI, and human experts.
Students used AI tools to locate candidate records in the primary literature, then evaluated each one for appropriateness, checking that the source was a credible cytogenetic study and that the count was unambiguous. Faculty and graduate students independently reviewed the curated records, providing a second pass over every entry before it entered the database. On the computation side, students used AI to help write and debug parsing scripts. Over the arc of the course, that work converged on a single script capable of handling every dataset format in the collection. The faculty and lead author independently developed their own analysis scripts and ran them in parallel, confirming that both approaches converged on the same answers. The result is a dataset no single graduate student could have compiled in a reasonable timeframe, built by people who were learning the biology and the tooling at the same time.