We are searching data for your request:
Upon completion, a link will appear to access the found materials.
I'm currently working aligning sequences, and I need to compute similarity between pairs of DNA 'words' of a particular length.
For amino acids I am able to use the substitution matrices in Biopython (Bio.SubsMat.MatrixInfo).
However, I haven't found anything similar for DNA, so I read up and found that most systems use a match/mismatch scoring system where each nucleotide match and mismatch is scored and then the scores are summed. This works fine as long as I am only dealing with A, G, C, and T, but I run into problems when I get a sequence containing N or M and the like (meaning nucleotide unknown).
Is there a standard way to handle the situation with unknowns? That is, how do I score A versus N or M versus N?
Thanks in advance.
BLASTN does not use a substitution matrix. There are scores for match, mismatch and gaps which you can also define.
There is no feature available as of now to allow scoring of matches against unknowns. They are considered mismatches (as shown below). If these unknown are in the middle of a HSP, then you can probably re-score the HSP according to your scheme using a python script. If the
Nstretch is disrupting the HSP, then you can try relaxing the mismatch penalties and reduce word size (basically reduce stringency). I can't think of any other solution.
Query 1 CAGCGTCCANNTCCCGAGGTGCCGGGATTGCAGACGGAGTCTGGTTCACTCAGTGCTCAA 60 ||||||||| ||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 8 CAGCGTCCACCTCCCGAGGTGCCGGGATTGCAGACGGAGTCTGGTTCACTCAGTGCTCAA 67 Query 61 TGGTGCCCAGGCTGGAGTGCAGTGGCGTGATCTCGGCTCGCTACANNCTCCACCTCCCAG 120 ||||||||||||||||||||||||||||||||||||||||||||| ||||||||||||| Sbjct 68 TGGTGCCCAGGCTGGAGTGCAGTGGCGTGATCTCGGCTCGCTACAACCTCCACCTCCCAG 127 Query 121 CCGCCTGCCCTGGCCTCCCAAAGTGCCGAGATTGCAGCCTCTGCCCAGCCGCCACCCC 178 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||| Sbjct 128 CCGCCTGCCCTGGCCTCCCAAAGTGCCGAGATTGCAGCCTCTGCCCAGCCGCCACCCC 18