General read set filter

Description

This tool is a general filtering tool for FASTQ or FASTA formatted reads files. It contains all the filtering methods used in the PRINSEQ package, including the length, complexity, Ns, and duplicate filtering methods that are available as separate filtering tools too.

Details

When you launch the filtering tool, only those filtering methods, for which some value is assigned, are used. You can use several filtering methods in the same time. Form example filtering with parameters Maximum length: 100 and Maximum count of Ns: 1 would produce a reads set where all the reads are shorter than 101 bases and they contain less than 2 Ns.

For detailed descriptions of the different filtering conditions, please check the manual of the PRINSEQ package. From the table below, you can check what PRINSEQ command line options the parameter definitions correspond:
Parameter namecommand line optionDescription
Maximum length-max_len Select only reads that are shorter than the given value.
Minimum length-min_len Select only reads that are longer than the given value.)
Maximum GC content-max_gc Select only reads that has GC content that is less than the given value.)
Minimum GC content-min_gc Select only reads that has GC content that is more than the given value.)
Minimum quality score -min_qual_score Filter reads with GC content below than the given value.)
Maximum quality score-max_qual_score Filter reads with GC content above then the given value.)
Minimum mean quality-min_qual_mean Filter reads with quality score mean below the given value.)
Maximum mean qualit -max_qual_mean Filter reads with quality score mean above the given value.)
Maximum percentage of Ns -ns_max_p Filter reads for which the percentage of Ns id higher than the given value.)
Maximun count of Ns-ns_max_n Filter reads for which the count of Ns is higher than the given value.)
Maximum number of reads-seq_num Only keep the given number number of reads, that pass all other filters.)
Type of duplicates to filter-derep Type of duplicates to filter.)
Number of allowed duplicates-derep_min This option specifies the number of allowed duplicates. For example, to remove reads that occur more than 5 times, you would specify value 6.)
Dust filter threshold -lc_method dust -lc_threshold Use DUST algorithm with the given threshold value, between 0 and 100, to filter sequences by sequence complexity. The dust method uses this as maximum allowed score.)
Entropy filter threshold -lc_method entropy -lc_threshold Use Entropy algorithm with the given threshold value, between 0 and 100, to filter reads by sequence complexity. The entropy method uses this as the as minimum allowed value.)
Quality data is in Phred+64 format-phred64 You should select \"yes\" option if the quality data in FASTQ file is in Phred+64 format. For Illumina 1.8+, Sanger, Roche/454, Ion Torrent, PacBio data, you should use the default value: no)

Output

The reads that pass the filtering condition are saved to file called accepted.fastq or accepted.fasta. You can also choose to write out the duplicate reads that are filtered out. These reads are stored to file rejected.fastq or rejected.fasta. You can also print out a log file that contains information about the filtering task and statistics about how many reads were accepted and rejected.

Reference

This tool is based on the PRINSEQ package.