[galaxy-dev] Problem with the output of megablast

Sandrine Hughes Thu, 05 Jan 2012 04:02:11 -0800

Dear all,

I’m using the online version of Galaxy (I’m just a user, not a
developer) and I think that there might be a problem with the output
of Megablast. Indeed, I’ve noticed that the GI of the database hit
(given column 2 in the output table) is not in agreement with the
details given in the other columns (notably the length of the hit is
inaccurate). After some investigation (see below for all the details),
I think that in most cases, the problem comes from a shift of –1
between the GI number and the details associated.
Can you confirm that there is a problem ? Do you think that it will
possible to fix it ?


Many thanks for you help as we really need this step in our process of analysis,

Sandrine





Here are the details:

I. First, what I have done :
    I used the program to identify the species that I have in a mix of
unknown sequences by using the following options:
            Database nt 27-Jun-2011
            Word size 16
            Identity 90.0
            Cutoff 0.001
            Filter out low complexity regions Yes
    I run the analyses twice and obtained exactly the same results (I
used the online version of Galaxy, not a local one).

II. Second, I analysed the data obtained for one of my sequence
(1-202). The following lines are the beginning of the table that I
obtained after the megablast and two lines with troubles:

 1-202   312182292       484     99.33   150     1       0       1
  150     1       150     2e-75   289.0
 1-202   312182201       476     99.33   150     1       0       1
  150     1       150     2e-75   289.0
 1-202   308228725       928     99.33   150     1       0       1
  150     19      168     2e-75   289.0
 1-202   308228711       938     99.33   150     1       0       1
  150     22      171     2e-75   289.0
 1-202   308197083       459     99.33   150     1       0       1
  150     10      159     2e-75   289.0
 1-202   300392378       920     99.33   150     1       0       1
  150     10      159     2e-75   289.0
 1-202   300392376       918     99.33   150     1       0       1
  150     9       158     2e-75   289.0
 1-202   300392375       922     99.33   150     1       0       1
  150     11      160     2e-75   289.0
 1-202   300392374       931     99.33   150     1       0       1
  150     21      170     2e-75   289.0
 1-202   300392373       909     99.33   150     1       0       1
  150     21      170     2e-75   289.0
 1-202   300392371       1172    99.33   150     1       0       1
  150     9       158     2e-75   289.0
...
1-202   179366399       151762  98.67   150     2       0       1
 150     46880   47029   6e-73   281.0
1-202   58617849        511     98.67   150     2       0       1
 150     21      170     6e-73   281.0


III. Third, what I’ve noticed:
    My first trouble was that among all the species identified, two
were very different from the expected ones (2 last lines). So I
decided to search if that could be possible for that sequence and
performed independently a megablast on the NCBI with similar options.
I was not able to find these two species in the results.
    So, I decided to check the hits identified in the table above and
identified a second trouble. In the table, the second column give the
GI of the database hit and the third column give the length of the
database hit. However, when I manually checked in NCBI the length of
the GI, this one was incorrect. Indeed, for the GI 312182292, the
length should be 580 and not 484.
    By checking different lines, I noticed that the length that is
given for a GI corresponds to the length of the GI-1. As you can see
in the above table, some GI are consecutive (300392376,
300392375,...). When checking the length of 300392376 in NCBI, I
should have 920. But when I checked 300392375, I found 918. And this
was true for the following lines : 300392374 give normally 922 and
300392373 give 931... My conclusion at that point was that there was a
shift of –1 between the GI and the other parameters of the line
(indeed the parameters for the remaining columns are in agreement with
the length of the GI-1). However, that’s not always true.... For some
GI given in the table (for example, the two last lines), if we check
the parameters of the GI-1, the parameters are completely different...
So, I suppose that there is a trouble in the GI sorting during the
megablast but I’m not able to clearly define the problem.

IV. Fourth, confirmed with an other dataset
    In order to be sure that the problem was not linked to my data or
my process, I asked a colleague to do a megablast on independent data.
The  conclusions were similar to mine : a shift in the GI given in the
table and the parameters associated, that most of the time but not
always, correspond to GI-1.

___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

[galaxy-dev] Problem with the output of megablast

Reply via email to