This tool filters out duplicate reads from a reads file.
This tool utilizes the PRINSEQ package.
The input data can be in FASTQ or in FASTA format.
The main purpose of removing duplicates is to mitigate the effects of PCR amplification bias introduced during library construction. In addition, removing duplicates can result in computational benefits by reducing the number of sequences that need to be processed and by lowering the memory requirements. Sequence duplicates can also impact abundance or expression measures and can result in false variant (SNP) calling.
The duplicate reads can be defined in several ways:The reads that pass the duplicate filtering condition are saved to file called accepted.fastq or accepted.fasta. You can also choose to write out the duplicate reads that are filtered out. These reads are stored to file rejected.fastq or rejected.fasta. You can also print out a log file that contains information about the filtering task and statistics about how many reads were accepted and rejected.