Re: [Samtools-help] Unexpected insert-size distribution given by samtools stats

2018-07-13 Thread James Bonfield
On Wed, Jul 11, 2018 at 03:24:13PM +0200, Dr. Frank Vorh?lter wrote:
> Dear James,
> 
> Thanks a lot for taking care. 
> The alignment that was analysed by samtools was generated by bwa and 
> subsequently processed using samtools. However, older versions of the 
> software were applied, bwa 0.7.5a-r405 and samtools 1.2 (using htslib 1.2.1). 
> The NGS reads had been trimmed before using cutadapt. Hence, part of the 
> reads were shorter than 150.  No other tools than bwa bwasw and samtools 
> fixmate & sort were involved in obtaining the BAM file with the alignment 
> that was analysed using samtools stats. For data analysis, a current version 
> of samtools was applied, samtools 1.8 (using htslib 1.8).

The insert size is very clearly halving before 150 (not doubly after I
think, when compared with the awk data).  I don't see this behaviour
with my tests, but I suspect something is causing it to filter data.

It's curious that there is a difference between a trivial awk script
vs samtools stats.  Obviously there may be a small number of badly
mapped pairs, supplementary and secondary reads, etc that could skew
the figures for the basic awk, but the key thing is there is no
discontinuity in the awk data.

Looking at the samtools stats code it has a check for PAIRED and
!UNMAPPED and !MATE_UNMAPPED.  It doesn't check for the PROPER_PAIR
flag.  (I didn't look at any of these in the awk noddy script.)

It then has some logic to halve the insert counts as the size will be
doubled due to counting each end twice (see comment "// Each pair was
counted twice" in stats.c).  I don't *think* this will be a problem,
but it does seem suspicious.

At this point I'm struggling to work out quite what causes this on
your data as I haven't seen it elsewhere.  Are you using any options
to samtools stats, eg -i for maximum insert size or -f / -F for
filtering?  If you supply -F 0xF00 (filtering SECONDARY, QCFAIL, DUP
and SUPPLEMENTARY) does the problem persist?  Does the awk equivalent
when given samtools view -F 0xF00 as input also show the problem?

I'm clutching at straws here to be honest!

James


-- 
James Bonfield (j...@sanger.ac.uk)
The Sanger Institute, Hinxton, Cambs, CB10 1SA


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help


Re: [Samtools-help] Unexpected insert-size distribution given by samtools stats

2018-07-11 Thread James Bonfield
Hello,

On Mon, Jul 09, 2018 at 05:51:15PM +0200, Dr. Frank Vorh?lter wrote:
> I had a detailed view on the insert size distribution of NGS read
> pairs from a BAM file using SAMtools stats of samtools 1.8 with htslib
> 1.8. The BAM file with the mapped reads represents a library where
> transposon-based fragmentation resulted in fragments that partly were
> shorter than the read lengths of the subsequent sequencing run on an
> Illumina instrument. Surprisingly, as you may see from the included
> diagram, the number of read pairs increased by almost exactly 100%
> when the insert size exceeded 150, which was the number of cycles in
> the sequencing run. This phenomenon was not obvious from experimental
> measurements of the library fragment lengths.

Doubling in size at 150 sounds suspiciously like a read-length effect,
possibly caused by something that thinks exact overlaps are
duplicates?  Are all of your reads 150bp?  Was there anything other
than fixmate and sort that was applied to this data?

I tried a really noddy example which generates a random genome and
picks 100bp pairs with an insert sizes with a flat distribution of 50
to 250bp. See below[1].  I then tried bwasw and mem alignments.  Eg:

gen.pl > test.fa
bwa index ref.fa
bwa mem ref.fa test.fa | samtools fixmate - - | tee test.bam | samtools 
stats - | egrep '^IS' > _
gnuplot
( plot "_" using 2:3 )

The observed plot matches the expected distribution, so I cannot
explain the cause of your observed distribution.

What if we just look at the isize figures quoted in the BAM file?  Eg:

samtools view test.bam|awk '{a[$9]++} END {for (i in a) {print 
i,a[i]}}'|sort -n

I also see the same distribution (albeit now positive and negative
halves).  This is a very crude check to see if the problem lies in the
ISIZE values produce by fixmate (which are directly a consequence of
the alignments themselves) or in the stats generation code.

James

[1]
#!/usr/bin/perl -w

# Create sequence
srand(0);
my @a = qw/A C G T/;
my $len = 10;
my $seq;

for (my $i=0; $i<$len; $i++) {
$seq .= $a[int(rand()*4)];
}
open(my $ref, ">ref.fa") || die;
print $ref ">ref\n$seq\n";
close($ref);

# Pick pairs of seqs 100 base long, with flat insert size between 50 and 250
for (my $i = 0; $i < 1; $i++) {
my $pos = int(rand()*($len-500*2))+500;
my $isize = int(rand()*200)+50;

print ">seq_$i\n",substr($seq,$pos,100),"\n";
my $s = reverse(substr($seq,$pos+$isize-100,100));
$s =~ tr/ACGT/TGCA/;
print ">seq_$i\n$s\n";
}



> 
>  
> What might cause this phenomenon, and is there a way to avoid it in case I 
> made a mistake? The alignment was generated using bwa bwasw and subsequently 
> processed using samtools fixmate and samtools sort. I were glad to understand 
> this puzzling effect.
> 
>  
> Frank
> 
>  
> 
>  
>  
>  
> Dr. rer. nat. Frank Vorh?lter
> Molekularbiologe
>  
> ?ber?rtliche Berufsaus?bungsgemeinschaft
> Medizinisches Versorgungszentrum
> Dr. Eberhard & Partner Dortmund
> MVZ-Haus 1: Brauhausstr. 4
> 44137 Dortmund, Germany 
> 
>  
> Tel.: +49 231 9572 6612
> 
>  


> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot

> ___
> Samtools-help mailing list
> Samtools-help@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/samtools-help


-- 
James Bonfield (j...@sanger.ac.uk)
The Sanger Institute, Hinxton, Cambs, CB10 1SA


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Samtools-help mailing list
Samtools-help@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/samtools-help