Filter out duplicate reads

Description

This tool filters out duplicate reads from a reads file.

Details

This tool utilizes the PRINSEQ package.
The input data can be in FASTQ or in FASTA format.

The main purpose of removing duplicates is to mitigate the effects of PCR amplification bias introduced during library construction. In addition, removing duplicates can result in computational benefits by reducing the number of sequences that need to be processed and by lowering the memory requirements. Sequence duplicates can also impact abundance or expression measures and can result in false variant (SNP) calling.

The duplicate reads can be defined in several ways: In addition to the the way how duplicates are defined, you should also define a threshold value: Number of allowed duplicates, that specifies the number of allowed duplicates. For example, to remove sequences that occur more than 5 times, you would specify value 6. Note that this parameter is used only for filtering exact duplicates or reverse complement exact duplicates.

Output

The reads that pass the duplicate filtering condition are saved to file called accepted.fastq or accepted.fasta. You can also choose to write out the duplicate reads that are filtered out. These reads are stored to file rejected.fastq or rejected.fasta. You can also print out a log file that contains information about the filtering task and statistics about how many reads were accepted and rejected.

Reference

This tool is based on the PRINSEQ package.