HOME › WIKI› TOPICS› GENOME ASSEMBLY

Chromosome-level genome assemblies — those in which sequencing contigs are scaffolded into chromosome-scale sequences — have become an essential tool for connecting raw sequence data to cytogenetic and evolutionary inference. A high-quality assembly resolves chromosome-level scaffolds, which can support inference about karyotype structure when combined with cytogenetic data, and also enables identification of sex-linked scaffolds and comparison of genome architecture across taxa. The quality of such assemblies is typically validated by cross-referencing scaffold counts and sizes against established cytogenetic data, and by computing completeness metrics such as BUSCO scores. A recurring finding is that a small number of large scaffolds capture the vast majority of genomic content, with hundreds of smaller scaffolds representing residual, often repetitive, sequence.

Chromosome-level genome assemblies are DNA sequences organized into chromosome-scale pieces — a major step up from the fragmented sequences that come directly from DNA-sequencing machines. When scientists can arrange these pieces into full chromosomes, the chromosome-scale scaffolds support inference about karyotype structure when combined with cytogenetic data, and also help identify which chromosomes carry sex-determining genes and how genome layout compares across species. (Direct chromosome counts still come from cytogenetic karyotyping rather than the assembly itself.) Scientists check whether their assemblies are correct by comparing the number and size of scaffolds (the organized pieces) against what cytogenetics — the study of chromosomes — already knows about that organism. They also calculate completeness scores, like BUSCO scores, to measure how much of the genome they've actually captured. One consistent pattern shows up: a small number of very large scaffolds hold most of the genetic information, while hundreds of smaller scaffolds contain leftover sequences that are often repetitive stretches of DNA.

Genome Assembly

Current understanding

Supporting evidence

Two chromosome-level assemblies of scarab beetles (Coleoptera: Scarabaeidae) illustrate these principles. The assembly of the endemic and endangered long-armed scarab Cheirotonus formosanus used Hi-C contact mapping to scaffold sequence into 10 primary large scaffolds — 9 autosomes plus an X chromosome — directly consistent with the modal Coleoptera karyotype of 2n = 20 (9AA + XY). As its authors note, “The final corrected contact map displayed 10 primary large scaffolds, including 9 autosomes and X chromosomes. This genetic architecture is highly consistent with known cytogenetic data for the group. The majority of Coleoptera possess a diploid number of 2n = 20 (9AA + XY).” This demonstrates that modern long-read sequencing combined with Hi-C scaffolding can recover biologically interpretable chromosome structure in non-model organisms (see Chien et al. 2026, Finding 1).

The first chromosome-level assembly for the genus Chrysina — based on Chrysina gloriosa — further illustrates expected assembly characteristics. The assembly spans 642 MB across 454 scaffolds, with the 10 largest scaffolds alone capturing 98% of the genome, a scaffold N50 of 72 MB, and a BUSCO score of 95.5% (A reference quality genome 2024, Finding 1). This pattern, where a handful of chromosome-scale scaffolds dominate the assembly while hundreds of minor scaffolds account for a small fraction of sequence, is consistent with what is expected when long-read technologies successfully resolve chromosome-scale structure. The high BUSCO completeness further indicates that gene-space is well represented despite residual fragmentation.

Contradictions / open disagreements

Two important caveats complicate interpreting assembly statistics at face value. First, in the C. formosanus case, chromosome number is inferred indirectly from Hi-C contact patterns and consistency with relatives rather than from direct cytogenetic counts on the focal species. If direct chromosome counts were to differ from the inferred 2n = 20, the scaffold-to-chromosome correspondence would need to be revisited.

Second, the C. gloriosa assembly (642 MB) is notably smaller than flow-cytometry-based genome size estimates (~850 MB) for the species. The authors attribute this discrepancy to unassembled repetitive content — a common limitation of current long-read approaches. This means that BUSCO scores and scaffold N50 values, while informative about gene-space completeness and contiguity, do not fully capture how much of a genome, particularly its repetitive fraction, has actually been assembled. Both cases underscore a broader methodological tension: assembly-based inference is powerful but benefits from orthogonal validation via cytogenetics and independent genome size estimation.

Tealc’s citation-neighborhood suggestions

Future work on this topic would benefit from comparative studies formally benchmarking Hi-C-based chromosome assignment against cytogenetic squash preparations, as well as broader surveys of chromosome-level Coleoptera assemblies that contextualize individual results within a larger phylogenetic framework. Studies specifically comparing flow-cytometry estimates to final assembly sizes across insects would help quantify the “repeatome gap” systematically.

Coleoptera genomics — 2 shared papers
Conservation Genomics — 2 shared papers