Re: [Samtools-help] mpileup output (base column)

2018-02-22 Thread Nicholas Hill
Hi John,

Thank you, this helps clarify things a lot. I should have read up on the
BAM/SAM format specification more before analyzing my pileup output.

Best,
Nicholas Hill

On Thu, Feb 22, 2018 at 6:29 AM, John Marshall <
john.w.marsh...@glasgow.ac.uk> wrote:

> On 21 Feb 2018, at 20:05, Nicholas Hill  wrote:
> > Say my pileup line is this:
> > chr373912   A   21  g,,..G,.gGGGgGGg,.Ggg
>  JJ >
> > So at chr3:73912 the reference was an A. On the forward strand (read1),
> there are 7 guanine base pairs that aligned to the reference sequence at
> this position. Additionally, on the reverse strand (read2), there are 6
> guanine base pairs that aligned to the reference sequence at this position
> ( or is it 6 cytosine base pairs, given that it is the reverse?). This is
> where I am confused.
>
> It may be clearer if you mock up a small SAM file and mpileup it yourself
> so you can see what is going on.  For example,
>
> @SQ SN:chr3 LN:100
> foo 0   chr373910   20  6M  *   0   0
>  GG  AA
> bar 16  chr373912   20  6M  *   0   0
>  GG  BB
>
> "samtools mpileup foobar.sam" gives  2  Gg  AB  mpileup columns at
> positions 73912-73915. The lowercase "g"s correspond to the bases of bar --
> as viewed on the reference sequence's forward strand, as one always does
> for mapped data in SAM files, as Tom noted. This read has been
> reverse-complemented (flags=16), so it was CC as viewed when it came
> off the sequencing machine, and it has been mapped to the reverse strand of
> the reference.
>
> > Also, what if my reference base is lowercase:
> > chr373912   a   21  g,,..G,.gGGGgGGg,.Ggg
>  JJ >
> > Does this mean that the reference base is actually a thymine, given that
> it is from the reference genome?
>
> No, it's an A. All that mpileup documentation about upper/lower-case and
> dots/commas indicating strands pertains only to the "g,,..G,.gGGGgGGg,.Ggg"
> column.
>
> The reference base in column 3 is simply copied from your -f genome.fa
> file. So your genome.fa reference genome must have a mixture of upper- and
> lowercase reference bases in it. Presumably this is some kind of masking,
> but what it's supposed to indicate would be a question for whoever made
> that reference genome .fa file -- mpileup is not altering it in any way.
>
> John
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help


Re: [Samtools-help] mpileup output (base column)

2018-02-22 Thread John Marshall
On 21 Feb 2018, at 20:05, Nicholas Hill  wrote:
> Say my pileup line is this:
> chr373912   A   21  g,,..G,.gGGGgGGg,.Ggg   JJ 
> So at chr3:73912 the reference was an A. On the forward strand (read1), there 
> are 7 guanine base pairs that aligned to the reference sequence at this 
> position. Additionally, on the reverse strand (read2), there are 6 guanine 
> base pairs that aligned to the reference sequence at this position ( or is it 
> 6 cytosine base pairs, given that it is the reverse?). This is where I am 
> confused.

It may be clearer if you mock up a small SAM file and mpileup it yourself so 
you can see what is going on.  For example,

@SQ SN:chr3 LN:100
foo 0   chr373910   20  6M  *   0   0   GG  
AA
bar 16  chr373912   20  6M  *   0   0   GG  
BB

"samtools mpileup foobar.sam" gives  2  Gg  AB  mpileup columns at positions 
73912-73915. The lowercase "g"s correspond to the bases of bar -- as viewed on 
the reference sequence's forward strand, as one always does for mapped data in 
SAM files, as Tom noted. This read has been reverse-complemented (flags=16), so 
it was CC as viewed when it came off the sequencing machine, and it has 
been mapped to the reverse strand of the reference.

> Also, what if my reference base is lowercase: 
> chr373912   a   21  g,,..G,.gGGGgGGg,.Ggg   JJ 
> Does this mean that the reference base is actually a thymine, given that it 
> is from the reference genome? 

No, it's an A. All that mpileup documentation about upper/lower-case and 
dots/commas indicating strands pertains only to the "g,,..G,.gGGGgGGg,.Ggg" 
column.

The reference base in column 3 is simply copied from your -f genome.fa file. So 
your genome.fa reference genome must have a mixture of upper- and lowercase 
reference bases in it. Presumably this is some kind of masking, but what it's 
supposed to indicate would be a question for whoever made that reference genome 
.fa file -- mpileup is not altering it in any way.

John
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help


Re: [Samtools-help] mpileup output (base column)

2018-02-21 Thread Thomas W. Blackwell

Nicholas  -

Letter is always relative to the forward strand (how it would 
look in the genome reference sequence).  My guess is that upper 
and lower case in column 3 distinguish coding from non-coding 
bases, or bases within repetitive sequence, or something like 
that.  (Take a look at the corresponding region in the genome 
reference sequence.)


-  tom blackwell  -

On Wed, 21 Feb 2018, Nicholas Hill wrote:


Hi all,

Could someone tell me if I am thinking about this correctly? The wording in
the documentation isn't clear enough for me to be confident that I am
correct.

Say my pileup line is this:
chr373912   A   21  g,,..G,.gGGGgGGg,.Ggg
JJ

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help