Re: [Samtools-help] mpileup output (base column)
Hi John, Thank you, this helps clarify things a lot. I should have read up on the BAM/SAM format specification more before analyzing my pileup output. Best, Nicholas Hill On Thu, Feb 22, 2018 at 6:29 AM, John Marshall < john.w.marsh...@glasgow.ac.uk> wrote: > On 21 Feb 2018, at 20:05, Nicholas Hill wrote: > > Say my pileup line is this: > > chr373912 A 21 g,,..G,.gGGGgGGg,.Ggg > JJ > > > So at chr3:73912 the reference was an A. On the forward strand (read1), > there are 7 guanine base pairs that aligned to the reference sequence at > this position. Additionally, on the reverse strand (read2), there are 6 > guanine base pairs that aligned to the reference sequence at this position > ( or is it 6 cytosine base pairs, given that it is the reverse?). This is > where I am confused. > > It may be clearer if you mock up a small SAM file and mpileup it yourself > so you can see what is going on. For example, > > @SQ SN:chr3 LN:100 > foo 0 chr373910 20 6M * 0 0 > GG AA > bar 16 chr373912 20 6M * 0 0 > GG BB > > "samtools mpileup foobar.sam" gives 2 Gg AB mpileup columns at > positions 73912-73915. The lowercase "g"s correspond to the bases of bar -- > as viewed on the reference sequence's forward strand, as one always does > for mapped data in SAM files, as Tom noted. This read has been > reverse-complemented (flags=16), so it was CC as viewed when it came > off the sequencing machine, and it has been mapped to the reverse strand of > the reference. > > > Also, what if my reference base is lowercase: > > chr373912 a 21 g,,..G,.gGGGgGGg,.Ggg > JJ > > > Does this mean that the reference base is actually a thymine, given that > it is from the reference genome? > > No, it's an A. All that mpileup documentation about upper/lower-case and > dots/commas indicating strands pertains only to the "g,,..G,.gGGGgGGg,.Ggg" > column. > > The reference base in column 3 is simply copied from your -f genome.fa > file. So your genome.fa reference genome must have a mixture of upper- and > lowercase reference bases in it. Presumably this is some kind of masking, > but what it's supposed to indicate would be a question for whoever made > that reference genome .fa file -- mpileup is not altering it in any way. > > John -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help
Re: [Samtools-help] mpileup output (base column)
On 21 Feb 2018, at 20:05, Nicholas Hill wrote: > Say my pileup line is this: > chr373912 A 21 g,,..G,.gGGGgGGg,.Ggg JJ > So at chr3:73912 the reference was an A. On the forward strand (read1), there > are 7 guanine base pairs that aligned to the reference sequence at this > position. Additionally, on the reverse strand (read2), there are 6 guanine > base pairs that aligned to the reference sequence at this position ( or is it > 6 cytosine base pairs, given that it is the reverse?). This is where I am > confused. It may be clearer if you mock up a small SAM file and mpileup it yourself so you can see what is going on. For example, @SQ SN:chr3 LN:100 foo 0 chr373910 20 6M * 0 0 GG AA bar 16 chr373912 20 6M * 0 0 GG BB "samtools mpileup foobar.sam" gives 2 Gg AB mpileup columns at positions 73912-73915. The lowercase "g"s correspond to the bases of bar -- as viewed on the reference sequence's forward strand, as one always does for mapped data in SAM files, as Tom noted. This read has been reverse-complemented (flags=16), so it was CC as viewed when it came off the sequencing machine, and it has been mapped to the reverse strand of the reference. > Also, what if my reference base is lowercase: > chr373912 a 21 g,,..G,.gGGGgGGg,.Ggg JJ > Does this mean that the reference base is actually a thymine, given that it > is from the reference genome? No, it's an A. All that mpileup documentation about upper/lower-case and dots/commas indicating strands pertains only to the "g,,..G,.gGGGgGGg,.Ggg" column. The reference base in column 3 is simply copied from your -f genome.fa file. So your genome.fa reference genome must have a mixture of upper- and lowercase reference bases in it. Presumably this is some kind of masking, but what it's supposed to indicate would be a question for whoever made that reference genome .fa file -- mpileup is not altering it in any way. John -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help
Re: [Samtools-help] mpileup output (base column)
Nicholas - Letter is always relative to the forward strand (how it would look in the genome reference sequence). My guess is that upper and lower case in column 3 distinguish coding from non-coding bases, or bases within repetitive sequence, or something like that. (Take a look at the corresponding region in the genome reference sequence.) - tom blackwell - On Wed, 21 Feb 2018, Nicholas Hill wrote: Hi all, Could someone tell me if I am thinking about this correctly? The wording in the documentation isn't clear enough for me to be confident that I am correct. Say my pileup line is this: chr373912 A 21 g,,..G,.gGGGgGGg,.Ggg JJ -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Samtools-help mailing list Samtools-help@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/samtools-help