[galaxy-user] MegaBLAST output

2012-04-23 Thread Sarah Hicks
I am having trouble finding information on the MegaBLAST output
columns. What is each column for? I can't seem to figure this out by
comparing info in the columns to NCBI directly because the GI#'s don't
match with the correct entry on NCBI. I've seen that others have
posted about that problem, so I'm also waiting on details on that
question, but for now, I'd just like to know what to make of the
output...
best,
Sarah
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] MegaBLAST output

2012-04-23 Thread Sarah Hicks
Thanks so much for the prompt reply. I don't mind using last years
GenBank, as long as I am getting accurate hits. I just have a couple
more questions to confirm I am safe using the Galaxy pipline for
this...
So if I continue to work within the the 1 year old database, can I
trust the output as accurate matches? Specifics about my project: I
have environmental samples that were sequenced for fungal ITS. I have
clustered these into OTUs, and chosen a representative sequence for
each. If I retrieve hits for this representative sequence file in my
sample, can I trust the hits as being the correct hits as of last
year? I'm just worried about what that one person said who thought
there was some column arrangement problems, because I'm finding that
I'm getting hits from different phylum for the same sequence using
default parameters in megablast...
Can I also assume, then, that I should NOT identify my representative
sequence file to updated GI numbers using another pipeline, and then
bring the file of GI numbers to Galaxy to fetch taxonomic assignments?
(which I would do because of the nice neat columns for each taxonomic
level Galaxy puts out)

Sarah

On Mon, Apr 23, 2012 at 2:26 PM, Jennifer Jackson j...@bx.psu.edu wrote:
 Hi Sarah,

 Peter defined the columns (thanks) but I can provide some information about
 the GenBank identifiers. The megablast database on the public server are
 roughly a year old and there have been updates at NCBI since that time. As I
 understand it, this manifests as occasional mismatches between hits at
 Galaxy vs Genbank when comparing certain IDs linked to updated records.

 We are working to update these three databases, but there are some
 complicating factors around this processing specifically related to the
 public instance and the metagenomics workflow that have yet to be resolved.
 Please know that getting updated is a priority for us and we apologize for
 the inconvenience.

 To use the most current databases, a local or (better) cloud instance with
 either the regular or BLAST+ version of the tool and a database your choice
 is the recommendation. Instructions to get started are at:
 getgalaxy.org
 getgalaxy.org/cloud

 Hopefully this explains the data mismatch. This question has come up before,
 but I think you are correct in that the final conclusion never was posted
 back to the galaxy-user list (for different reasons). So, thank you for
 asking so we that could send out a clear reply for everyone using the tool.

 Best,

 Jen
 Galaxy team


 On 4/23/12 9:56 AM, Sarah Hicks wrote:

 I am having trouble finding information on the MegaBLAST output
 columns. What is each column for? I can't seem to figure this out by
 comparing info in the columns to NCBI directly because the GI#'s don't
 match with the correct entry on NCBI. I've seen that others have
 posted about that problem, so I'm also waiting on details on that
 question, but for now, I'd just like to know what to make of the
 output...
 best,
 Sarah
 ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org.  Please keep all replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

 To manage your subscriptions to this and other Galaxy lists,
 please use the interface at:

   http://lists.bx.psu.edu/


 --
 Jennifer Jackson
 http://galaxyproject.org

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] MegaBLAST output

2012-04-23 Thread Sarah Hicks
Peter, you requested an example, here are the first five hits for my
first query sequence (OTU#0)

0   324034994   527 93.23   266 13  5   1   265 
22  283 7e-102  379.0
0   56181650513 93.26   267 10  8   1   265 
25  285 7e-102  379.0
0   314913953   582 91.79   268 13  9   1   265 
24  285 2e-92   347.0
0   305670062   281 92.52   254 14  5   4   256 
32  281 2e-92   347.0
0   310814066   118091.73   266 14  7   1   265 
24  282 9e-92   345.0

You will notice there are 13 columns, one in addition to the 12 column
titles you explained. This is because there is a column between sseqID
and pident. In the metagenomic tutorial the first 4 columns are
explained, and column 3 is described as length of sequence in database
(or length of the subject sequence).

This is the problem column. The length of only one of the subject GI
numbers above match the subject length in NCBI. This has caused me to
wonder if I can trust the hit info. In all cases that I've checked,
when this happens the correct match is the listed GI value minus 1
(ie, in NCBI, gi|324034994 is not 527nt long, but 324034993 IS 527nt
long).



On Mon, Apr 23, 2012 at 11:05 AM, Peter Cock p.j.a.c...@googlemail.com wrote:
 On Mon, Apr 23, 2012 at 5:56 PM, Sarah Hicks garlicsc...@gmail.com wrote:
 I am having trouble finding information on the MegaBLAST output
 columns. What is each column for? I can't seem to figure this out by
 comparing info in the columns to NCBI directly because the GI#'s don't
 match with the correct entry on NCBI. I've seen that others have
 posted about that problem, so I'm also waiting on details on that
 question, but for now, I'd just like to know what to make of the
 output...
 best,
 Sarah

 I've not tried to track down this reported possible bug in GI numbers,
 and weather it also affects BLAST+ as well as the legacy NCBI BLAST
 (which has now been discontinued). Do you have a specific example.

 As to the 12 columns, they are standard BLAST tabular output, and
 should match the defaults in BLAST+ tabular output which are:

 Column  NCBI name       Description
 1       qseqid  Query Seq-id (ID of your sequence)
 2       sseqid  Subject Seq-id (ID of the database hit)
 3       pident  Percentage of identical matches
 4       length  Alignment length
 5       mismatch        Number of mismatches
 6       gapopen         Number of gap openings
 7       qstart  Start of alignment in query
 8       qend    End of alignment in query
 9       sstart  Start of alignment in subject (database hit)
 10      send    End of alignment in subject (database hit)
 11      evalue  Expectation value (E-value)
 12      bitscore        Bit score

 Peter

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/