PTHP: Index for Optimizing Genome Assembly Overlapping and Read Alignment
DOI:
https://doi.org/10.15379/ijmst.v10i1.2690Keywords:
Data structures; Genome assembly; Genome indexing; Genome assembly overlapping; Genome assembly read alignment; Genome assembly performance; Genome prefix tree index; Genome hash indexAbstract
Unfortunately, sequencing technology can only access the genome sequence as massive numbers of short strings are called reads. The genome assembly process constructs the complete genome from these reads based on the overlapping between the reads, called the de novo approach, or aligns the reads based on their positions in the available reference genome, called the reference-guided approach. Millions of reads search for overlapping or alignment, a well-known data structure problem called all-against-all. Many studies have proposed indexing such as hash index, prefix tree index, and parallelization technique to optimize the overlapping or the read alignment individually. However, due to the massive data amount and the repeats, limitations still affect the index efficiency, requiring more enhancements. This article introduces a new hybrid index named Prefix Tree Hash Partitioned index(PTHP), which combines prefix-tree index, hash index, pigeonhole concept, and parallelization. PTHP index reveals significant results on the simulation and real dataset, reducing the computational time complexity of overlapping and read alignment, thus the assembly time outperforming prefix tree index and hash index. Improving the performance of overlapping and read alignment using the PTHP index reveals great results in optimizing the hybrid genome assembly that combines both.