File minimizer algorithm

1/5/2024

These error values are then converted into integers.

This value is the probability that the corresponding base call is incorrect. Line 4 represents the quality score value for the sequence in Line 2 and has the same length as Line 2.Īs indicated above, the NGS produces DNA sequences and a corresponding quality score value for each base.Line 3 begins with the + character, optionally followed by the same sequence identifier as Line 1.Line 1 stores a sequence identifier and begins with Line 2 represents the read sequence.The NGS-generated data are stored in a FASTQ format FASTQ files comprise millions-to-billions of records, each of which containing the following four lines: Therefore, it is important for researchers to have an appropriate tool that is specifically developed for NGS data compression. However, these are not ideal solutions for compressing NGS data since both were designed for general-purposes compression and have been shown to perform inadequately when compressing genomic data. Most data centers use generic compressors, such as the gzip and bzip2. The time required for data transmission over network can be reduced by compressing the highly redundant genomics data. Data centers are often used as a solution, while incurring considerable costs for storage space and transmission bandwidth. However, the storage technology is evolving at a much slower pace compared with the NGS technologies, thereby posing challenges for data storage. Next-generation sequencing (NGS) technologies have accelerated genomic research, thereby producing significant amount of data at a fast pace and low cost. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.ĭata Availability: The data underlying this study is deposited to the NCBI SRA ( ) under the following accession numbers: SRX000376, SRX000706, SRX000712, SRX000711, SRX002925, SRX011353, SRX181937, SRX089128, SRX533603, SRX5822585, SRX5327410, ERX3333090, ERX593919.įunding: The authors received no specific funding for this work.Ĭompeting interests: The authors have declared that no competing interests exist. Received: MaAccepted: OctoPublished: November 14, 2019Ĭopyright: © 2019 Al Yami, Huang. PLoS ONE 14(11):Įditor: Ruslan Kalendar, University of Helsinki, FINLAND Ĭitation: Al Yami S, Huang C-H (2019) LFastqC: A lossless non-reference-based FASTQ compressor. Moreover, LFastqC has a better compression and decompression speed than LFQC, which was previously the top-performing compression algorithm for the LS454 dataset. LFastqC is compared with several state-of-the-art compressors, and the results indicate that LFastqC achieves better compression ratios for important datasets such as the LS454, PacBio, and MinION. We present a lossless, non-reference-based FASTQ compression algorithm, known as LFastqC, an improvement over the LFQC tool, to address these issues. The generated NGS data are highly redundant and need to be efficiently compressed to reduce the cost of storage space and transmission bandwidth. The cost-effectiveness of next-generation sequencing (NGS) has led to the advancement of genomic research, thereby regularly generating a large amount of raw data that often requires efficient infrastructures such as data centers to manage the storage and transmission of such data.

0 Comments

File minimizer algorithm

Leave a Reply.

Author

Archives

Categories