We are searching data for your request:
Upon completion, a link will appear to access the found materials.
My question is about the CIGAR specification.
The documentation states:
M 0 alignment match (can be a sequence match or mismatch)
I 1 insertion to the reference
Q1: If I have a CIGAR string
99M170661N26M, does that mean I have 99 matches. Can that also mean 99 mismatches? What about 98 matches and 1 mismatch? The specification states it could be a sequence match or mismatch.
Q2: Assuming I'm doing an RNA-seq experiment. In my CIGAR string
99M170661N26M, does that mean my read aligns to an intron which has 17066 bases? Can I think like: "I have a spliced read which aligns to two exons and an intron. I have 99 matches to my first exon and 26 matches to my second exon. I have 17066 matches to my intron."?
Can I get the number of matches in my read? Does that even make sense? For example, how do I know if I have 10 exact matches?
Q1: Yes, the CIGAR string operation 'M' means you can have both matches and mismatches. For example if you have a CIGAR string 10M, then you could have 10 bases that matched perfectly or you could have 5 matches and 5 mismatches. In this case, you have what are known as SNPs (or single nucleotide polymorphisms).
Q2: In your case, you have 99 matches, 170661 skipped regions in the reference genome, and 26 matches.
Q3: If you want to find out how many matches or mismatches there were, you'd have to take the reads from your SAM file, align them to their position in the reference genome, and compare them base pair by base pair to see if their nucleotides match. You can take advantage of the positional information found in field 4 of the alignment to accomplish this.
You can write your own code to accomplish this or use something like Samtools, which I believe comes with a variant caller.