Removing degenerate characters¶
Degenerate IUPAC base symbols represent a site position that can have multiple possible characters. For a DNA example, “Y” represents pyrimidines where the site can be either “C” or “T”.
Note
In many molecular evolutionary and phylogenetic analyses, the gap character “-” is treated “N”, meaning any base.
Let’s create sample data with degenerate characters
Omit aligned columns containing a degenerate character¶
Omit all degenerate characters except gaps from an alignment¶
If we create the app with the argument gap_is_degen=False
, we can omit degenerate characters but retain gaps.
Omit k-mers which contain degenerate characters¶
If we create omit_degenerates
with the argument motif_length
, it will split sequences into non-overlapping tuples of the specified length and exclude any tuple that contains a degenerate character.