Re: [galaxy-user] [galaxy-bugs] GI errors in the megablast table of results ?

2012-03-01 Thread Peter Cock
Hello all,

Did this issue get resolved?

If Sandrine was right about there being an off by one error in GI number in
the BLAST tabular output, it could be a bug in 'legacy' blastall command.

I say 'legacy' BLAST because that's what Galay's NGS 'megablast' tool
is using internally (as opposed to the the NCBI's replacement BLAST+).

Peter

On Wed, Jan 25, 2012 at 3:14 PM, Guru Ananda g...@psu.edu wrote:
 Dear Sandrine,

 Thanks for pointing out this issue.
 The BLAST databases we have on Galaxy are from last year, while those on
 NCBI website are the latest (Jan 2012). As pointed out on NCBI website
 (http://www.ncbi.nlm.nih.gov/Sitemap/sequenceIDs.html), it appears that each
 time any change is made to a sequence/database, GI numbers change as well.
 This is perhaps why you're observing discrepancies in GI numbers and lengths
 between megablast outputs on Galaxy and NCBI. I'm currently in the process
 of downloading the latest BLAST databases from NCBI, and I'll let you know
 when they're available for use on Galaxy.

 Thanks for your patience,
 Guru
 Galaxy team.


 On Wed, Nov 9, 2011 at 8:03 AM, Sandrine Hughes
 sandrine.hug...@ens-lyon.fr wrote:

 Dear all,

 I’m not sure where I need to send my email so I apologize if I’m wrong.

 I have a trouble with the Megablast program available in NGS Mapping and I
 hope that you can help. Indeed, I think that there might be a problem with
 the table given in output, and notably a shift between the GI numbers and
 the parameters associated.

 Here are the details:

 I. First, what I have done :
 I used the program to identify the species that I have in a mix of
 sequences by using the following options:
 Database nt 27-Jun-2011
 Word size 16
 Identity 90.0
 Cutoff 0.001
 Filter out low complexity regions Yes
 I run the analyses twice and obtained exactly the same results (I used
 the online version of Galaxy, not a local one).

 II. Second, I analysed the data obtained for one of my sequence (1-202).
 The following lines are the beginning of the table that I obtained after the
 megablast and two lines with troubles:

  1-202   312182292   484 99.33   150 1   0   1
   150 1   150 2e-75   289.0
  1-202   312182201   476 99.33   150 1   0   1
   150 1   150 2e-75   289.0
  1-202   308228725   928 99.33   150 1   0   1
   150 19  168 2e-75   289.0
  1-202   308228711   938 99.33   150 1   0   1
   150 22  171 2e-75   289.0
  1-202   308197083   459 99.33   150 1   0   1
   150 10  159 2e-75   289.0
  1-202   300392378   920 99.33   150 1   0   1
   150 10  159 2e-75   289.0
  1-202   300392376   918 99.33   150 1   0   1
   150 9   158 2e-75   289.0
  1-202   300392375   922 99.33   150 1   0   1
   150 11  160 2e-75   289.0
  1-202   300392374   931 99.33   150 1   0   1
   150 21  170 2e-75   289.0
  1-202   300392373   909 99.33   150 1   0   1
   150 21  170 2e-75   289.0
  1-202   300392371   1172    99.33   150 1   0   1
   150 9   158 2e-75   289.0
 ...
 1-202   179366399   151762  98.67   150 2   0   1
   150 46880   47029   6e-73   281.0
 1-202   58617849    511 98.67   150 2   0   1
   150 21  170 6e-73   281.0


 III. Third, what I’ve noticed:
 My first trouble was that among all the species identified, two were
 very different from the expected ones (2 last lines). So I decided to search
 if that could be possible for that sequence and performed independently a
 megablast on the NCBI with similar options. I was not able to find these two
 species in the results.
 So, I decided to check the hits identified in the table above and
 identified a second trouble. In the table, the second column give the GI of
 the database hit and the third column give the length of the database hit.
 However, when I manually checked in NCBI the length of the GI, this one was
 incorrect. Indeed, for the GI 312182292, the length should be 580 and not
 484.
 By checking different lines, I noticed that the length that is given
 for a GI corresponds to the length of the GI-1. As you can see in the above
 table, some GI are consecutive (300392376, 300392375,...). When checking the
 length of 300392376 in NCBI, I should have 920. But when I checked
 300392375, I found 918. And this was true for the following lines :
 300392374 give normally 922 and 300392373 give 931... My conclusion at that
 point was that there was a shift of –1 between the GI and the other
 parameters of the line (indeed the parameters for the remaining columns are
 in agreement with the 

Re: [galaxy-user] [galaxy-bugs] GI errors in the megablast table of results ?

2012-01-25 Thread Guru Ananda
Dear Sandrine,

Thanks for pointing out this issue.
The BLAST databases we have on Galaxy are from last year, while those on
NCBI website are the latest (Jan 2012). As pointed out on NCBI website (
http://www.ncbi.nlm.nih.gov/Sitemap/sequenceIDs.html), it appears that each
time any change is made to a sequence/database, GI numbers change as well.
This is perhaps why you're observing discrepancies in GI numbers and
lengths between megablast outputs on Galaxy and NCBI. I'm currently in the
process of downloading the latest BLAST databases from NCBI, and I'll let
you know when they're available for use on Galaxy.

Thanks for your patience,
Guru
Galaxy team.


On Wed, Nov 9, 2011 at 8:03 AM, Sandrine Hughes sandrine.hug...@ens-lyon.fr
 wrote:

  Dear all,

 I’m not sure where I need to send my email so I apologize if I’m wrong.

 I have a trouble with the Megablast program available in NGS Mapping and I
 hope that you can help. Indeed, I think that there might be a problem with
 the table given in output, and notably a shift between the GI numbers and
 the parameters associated.

 Here are the details:

 I. First, what I have done :
 I used the program to identify the species that I have in a mix of
 sequences by using the following options:
 Database nt 27-Jun-2011
 Word size 16
 Identity 90.0
 Cutoff 0.001
 Filter out low complexity regions Yes
 I run the analyses twice and obtained exactly the same results (I used
 the online version of Galaxy, not a local one).

 II. Second, I analysed the data obtained for one of my sequence (1-202).
 The following lines are the beginning of the table that I obtained after
 the megablast and two lines with troubles:

  1-202   312182292   484 99.33   150 1   0   1
   150 1   150 2e-75   289.0
  1-202   312182201   476 99.33   150 1   0   1
   150 1   150 2e-75   289.0
  1-202   308228725   928 99.33   150 1   0   1
   150 19  168 2e-75   289.0
  1-202   308228711   938 99.33   150 1   0   1
   150 22  171 2e-75   289.0
  1-202   308197083   459 99.33   150 1   0   1
   150 10  159 2e-75   289.0
  1-202   300392378   920 99.33   150 1   0   1
   150 10  159 2e-75   289.0
  1-202   300392376   918 99.33   150 1   0   1
   150 9   158 2e-75   289.0
  1-202   300392375   922 99.33   150 1   0   1
   150 11  160 2e-75   289.0
  1-202   300392374   931 99.33   150 1   0   1
   150 21  170 2e-75   289.0
  1-202   300392373   909 99.33   150 1   0   1
   150 21  170 2e-75   289.0
  1-202   300392371   117299.33   150 1   0   1
   150 9   158 2e-75   289.0
 ...
 1-202   179366399   151762  98.67   150 2   0   1
   150 46880   47029   6e-73   281.0
 1-202   58617849511 98.67   150 2   0   1
   150 21  170 6e-73   281.0


 III. Third, what I’ve noticed:
 My first trouble was that among all the species identified, two were
 very different from the expected ones (2 last lines). So I decided to
 search if that could be possible for that sequence and performed
 independently a megablast on the NCBI with similar options. I was not able
 to find these two species in the results.
 So, I decided to check the hits identified in the table above and
 identified a second trouble. In the table, the second column give the GI of
 the database hit and the third column give the length of the database hit.
 However, when I manually checked in NCBI the length of the GI, this one was
 incorrect. Indeed, for the GI 312182292, the length should be 580 and not
 484.
 By checking different lines, I noticed that the length that is given
 for a GI corresponds to the length of the GI-1. As you can see in the above
 table, some GI are consecutive (300392376, 300392375,...). When checking
 the length of 300392376 in NCBI, I should have 920. But when I checked
 300392375, I found 918. And this was true for the following lines :
 300392374 give normally 922 and 300392373 give 931... My conclusion at that
 point was that there was a shift of –1 between the GI and the other
 parameters of the line (indeed the parameters for the remaining columns are
 in agreement with the length of the GI-1). However, that’s not always
 true For some GI given in the table (for example, the two last lines),
 if we check the parameters of the GI-1, the parameters are completely
 different... So, I suppose that there is a trouble in the GI sorting during
 the megablast but I’m not able to clearly define the problem.

 IV. Fourth, confirmed with an other dataset
 In order to be sure that the