Re: [galaxy-user] Number of mismatches allowed in the initial read mapping

2012-09-07 Thread Jennifer Jackson

Hello Jianguang,

This is in reply to this email and your prior email from yesterday 9/6 
subject "Tophat settings".


The testing here was a very good way to see how parameters impact mapping.

In addition, see below ...

On 9/6/12 2:32 PM, Du, Jianguang wrote:


Dear All,

I tested how to set the "Number of mismatches allowed in the initial 
read mapping" as follows.


At first, I ran FASTQ Groomer on a dataset to get the number of total 
reads. The total number of the reads is 17510227.


Then I ran Tophat after set "Number of mismatches allowed in the 
initial read mapping" as 1, and then ran "flagstat" under "NGS: SAM 
Tools". Here is the statistic information of Thophat output:


18162942 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 duplicates

18162942 + 0 mapped (100.00%:-nan%)

0 + 0 paired in sequencing

0 + 0 read1

0 + 0 read2

0 + 0 properly paired (-nan%:-nan%)

0 + 0 with itself and mate mapped

0 + 0 singletons (-nan%:-nan%)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

Next I ran Tophat after set "Number of mismatches allowed in the 
initial read mapping" as 0, and then ran "flagstat" under "NGS: SAM 
Tools". Here is the statistic information of Thophat output:


16100027 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 duplicates

16100027 + 0 mapped (100.00%:-nan%)

0 + 0 paired in sequencing

0 + 0 read1

0 + 0 read2

0 + 0 properly paired (-nan%:-nan%)

0 + 0 with itself and mate mapped

0 + 0 singletons (-nan%:-nan%)

0 + 0 with mate mapped to a different chr

0 + 0 with mate mapped to a different chr (mapQ>=5)

Does it mean about 0.6 million reads are aligned for 2 times or more 
after I set "Number of mismatches allowed in the initial read mapping" 
as 1,


Maybe, but I don't think it is that simple, nor something that is 
important for the final result. What this really means in the end is 
that more reads were permitted to be mapped because the criteria was 
less stringent. A mismatch of 1 was allowed in the initial step, so more 
reads were available to meet the other mapping criteria. More of these 
passed the downsteam mapping criteria than in the other dataset and were 
eventually included in the output. This is why the number of mapped 
reads is higher.


however about 1.4 million reads can not be aligned because of more 
stringent setting?


Yes, if you use more stringent criteria (with any mapping tool, not just 
TopHat), less of your data will map. A mismatch of 0 is an exact match, 
which is maximum stringency, so less reads met the initial mapping 
criteria, removing them from downstream evaluation by other mapping 
criteria. Then, less of them passed this downstream mapping criteria 
than in the other dataset and less were included in the output. This is 
why the number of mapped reads is lower.


If the other mapping criteria for both runs was the same, and the only 
variable change was this one, then a reasonable way to explain these 
results would be to state something like: the initial mapping with 
mismatch 0 filtered out sequences that would have otherwise mapped if a 
mismatch 1 were used instead.


Which number should we choose?

This is something that you will need to decide. There are likely many 
ways to analyze this further, but sometimes just actually looking at 
some of the data in browser can provide a lot of information that 
statistics cannot. Pick a few favorite (complex and simple) gene bounds 
with spliced transcripts, add in your mapping results, put the data into 
a browser (Trackster, UCSC, etc.) and see which make the most sense for 
your particular experiment, dataset, and genome. (There are no hard 
rules around this).


I don't mean to push you towards another list again, but want you to get 
the answers you need. If you really do have serious concerns about how 
the TopHat mapping algorithm itself is functioning, or suspect a 
problem, the tool authors and the mailing list dedicated to this exact 
topic is really the best resource to discuss the finer details. 
tophat.cuffli...@gmail.com


Best wishes for your project,

Jen
Galaxy team


Thanks.

Jianguang



___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/


--
Jennifer Jackson
http://galaxyproject.org

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
u

[galaxy-user] Number of mismatches allowed in the initial read mapping

2012-09-06 Thread Du, Jianguang
Dear All,



I tested how to set the "Number of mismatches allowed in the initial read 
mapping" as follows.



At first, I ran FASTQ Groomer on a dataset to get the number of total reads. 
The total number of the reads is 17510227.



Then I ran Tophat after set "Number of mismatches allowed in the initial read 
mapping" as 1, and then ran "flagstat" under "NGS: SAM Tools". Here is the 
statistic information of Thophat output:
18162942 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
18162942 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)



Next I ran Tophat after set "Number of mismatches allowed in the initial read 
mapping" as 0, and then ran "flagstat" under "NGS: SAM Tools". Here is the 
statistic information of Thophat output:
16100027 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
16100027 + 0 mapped (100.00%:-nan%)
0 + 0 paired in sequencing
0 + 0 read1
0 + 0 read2
0 + 0 properly paired (-nan%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (-nan%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)



Does it mean about 0.6 million reads are aligned for 2 times or more after I 
set "Number of mismatches allowed in the initial read mapping" as 1, however 
about 1.4 million reads can not be aligned because of more stringent setting? 
Which number should we choose?



Thanks.

Jianguang
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/