This tool is a general filtering tool for FASTQ or FASTA formatted reads files. It contains all the filtering methods used in the PRINSEQ package, including the length, complexity, Ns, and duplicate filtering methods that are available as separate filtering tools too.
When you launch the filtering tool, only those filtering methods, for which some value is assigned, are used. You can use several filtering methods in the same time. Form example filtering with parameters Maximum length: 100 and Maximum count of Ns: 1 would produce a reads set where all the reads are shorter than 101 bases and they contain less than 2 Ns.
For detailed descriptions of the different filtering conditions, please check the manual of the PRINSEQ package. From the table below, you can check what PRINSEQ command line options the parameter definitions correspond:
Parameter name | command line option | Description |
Maximum length | -max_len | Select only reads that are shorter than the given value. |
Minimum length | -min_len | Select only reads that are longer than the given value.) |
Maximum GC content | -max_gc | Select only reads that has GC content that is less than the given value.) |
Minimum GC content | -min_gc | Select only reads that has GC content that is more than the given value.) |
Minimum quality score | -min_qual_score | Filter reads with GC content below than the given value.) |
Maximum quality score | -max_qual_score | Filter reads with GC content above then the given value.) |
Minimum mean quality | -min_qual_mean | Filter reads with quality score mean below the given value.) |
Maximum mean qualit | -max_qual_mean | Filter reads with quality score mean above the given value.) |
Maximum percentage of Ns | -ns_max_p | Filter reads for which the percentage of Ns id higher than the given value.) |
Maximun count of Ns | -ns_max_n | Filter reads for which the count of Ns is higher than the given value.) |
Maximum number of reads | -seq_num | Only keep the given number number of reads, that pass all other filters.) |
Type of duplicates to filter | -derep | Type of duplicates to filter.) |
Number of allowed duplicates | -derep_min | This option specifies the number of allowed duplicates. For example, to remove reads that occur more than 5 times, you would specify value 6.) |
Dust filter threshold | -lc_method dust -lc_threshold | Use DUST algorithm with the given threshold value, between 0 and 100, to filter sequences by sequence complexity. The dust method uses this as maximum allowed score.) |
Entropy filter threshold | -lc_method entropy -lc_threshold | Use Entropy algorithm with the given threshold value, between 0 and 100, to filter reads by sequence complexity. The entropy method uses this as the as minimum allowed value.) |
Quality data is in Phred+64 format | -phred64 | You should select \"yes\" option if the quality data in FASTQ file is in Phred+64 format. For Illumina 1.8+, Sanger, Roche/454, Ion Torrent, PacBio data, you should use the default value: no) |
The reads that pass the filtering condition are saved to file called accepted.fastq or accepted.fasta. You can also choose to write out the duplicate reads that are filtered out. These reads are stored to file rejected.fastq or rejected.fasta. You can also print out a log file that contains information about the filtering task and statistics about how many reads were accepted and rejected.