Dfind old genome assemblies

9/5/2023

The newest long-read assemblers are therefore starting to be good also at goal 3 (refs. One class of modern high-throughput sequencing machines produces short (100–300 base pairs (bp)) and accurate (base error 10 kilobases (kb)) but error-prone (base error 2 kb in a fast way 16, 18, 19. The type of overlap detected, and therefore the type of assembly graph constructed, is related to the sequencing technology used to generate the reads. However, by splitting the reads into k-mers, valuable information from the reads may be lost, especially when these are much longer than the selected k-mer size 3. Moreover, the de Bruijn graph is useful for characterizing repeated as well as unique sequences of a genome (repeat graph 9). Unlike a string graph, the de Bruijn graph is a base-level graph 1, 6, 7, 8 thus, a path (contig) represents a consensus sequence derived from a pileup of the reads generating the k-mers ( k-mer frequency). The k-mers are usually stored in hash tables (constant query time), thus avoiding entirely the costly all-versus-all read comparison 6, 7, 8. In this approach, the fixed-length exact overlaps are detected by breaking the reads into consecutive k-mers 1. However, de Bruijn graphs have several favorable properties making them the method of choice in many modern short-read assemblers 6, 7, 8. These graph properties are the foundation of the overlap–layout–consensus (OLC) paradigm 3, 4, 5.Ī seemingly counterintuitive idea is to fix the overlap length to a given size ( k) to build a de Bruijn graph 1. The read-level nature implies that a path in such a graph represents a read layout, and a subsequent consensus step must be performed to improve the quality of bases called along the path 3. However, read-level graph construction requires an expensive all-versus-all read comparison 3. The main goal of the overlap graph approach and of its subsequent evolution, namely the string graph 3, is to preserve the read information 2, 3. The first graph-based genome assemblers used overlaps of variable length to construct an overlap graph 2. The entanglement comes from the complexity that repetitive genomic regions induce in the assembly graphs 1, 2. The main idea behind these algorithms is to reduce the genome assembly problem to a path problem where the genome is reconstructed by finding the true genome path in a tangled assembly graph 1, 2. Most genome assemblers represent the overlap information using different kinds of assembly graph 1, 2. Genome assembly is the process by which an unknown genome sequence is constructed by detecting overlaps between a set of redundant genomic reads.

In particular, the WENGAN assembly of the haploid CHM13 sample achieved a contig NG50 of 80.64 Mb (NGA50: 59.59 Mb), which surpasses the contiguity of the current human reference genome (GRCh38 contig NG50: 57.88 Mb). The resulting genome assemblies have high contiguity (contig NG50: 17.24–80.64 Mb), few assembly errors (contig NGA50: 11.8–59.59 Mb), good consensus quality (QV: 27.84–42.88) and high gene completeness (BUSCO complete: 94.6–95.2%), while consuming low computational resources (CPU hours: 187–1,200).

WENGAN implements efficient algorithms to improve assembly contiguity as well as consensus quality. We demonstrate de novo assembly of four human genomes using a combination of sequencing data generated on ONT PromethION, PacBio Sequel, Illumina and MGI technology. Here we report an algorithm for hybrid assembly, WENGAN, that provides very high quality at low computational cost. Generating accurate genome assemblies of large, repeat-rich human genomes has proved difficult using only long, error-prone reads, and most human genomes assembled from long reads add accurate short reads to polish the consensus sequence.

0 Comments

Dfind old genome assemblies

Leave a Reply.

Author

Archives

Categories