Re: [galaxy-user] Megablast database identity

2014-04-28 Thread Jennifer Jackson

Hello,

The Megablast htgs, nt, and wgs databases are in the process of being 
updated to the latest NCBI releases and are expected to be available by 
tomorrow morning (possibly sooner).


Should you wish to continue your analysis using the prior versions, 
these are available through our rsync server for use in a local 
production or cloud Galaxy.


https://wiki.galaxyproject.org/Admin/UseGalaxyRsync
http://getgalaxy.org
http://usegalaxy.org/cloud

Also posted to Galaxy Biostar: https://biostar.usegalaxy.org/p/7335/#7340

Best,
Jen
Galaxy team



On 4/27/14 4:25 AM, Scott W. Tighe wrote:

Jennifer

I am megablasting a simple 500,000 line dataset that is certainly in 
galaxy fasta.


For a week i have been seeing numerous errors. So i have reprocesed 
the data multiple times.


The error message is  could not find specified database directory

Is there an alternative approach? I did try ncbi directly and that 
failed too. They say they have a database issue



Scott Tighe


Core Laboratory Research Staff
Advanced Genome Technologies Core
Deep Sequencing (MPS) Facility
Vermont Cancer Center
149 Beaumont Ave
University of Vermont HSRF 303
Burlington Vermont  USA 05045
802-656-AGTC
802-999- (cell)





___
The Galaxy User List is being replaced by the Galaxy Biostar
User Support Forum at https://biostar.usegalaxy.org/

Posts to this list will be disabled in May 2014.  In the
meantime, you are encouraged to post all new questions to
Galaxy Biostar.

For discussion of local Galaxy instances and the Galaxy
source code, please use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

 http://galaxyproject.org/search/mailinglists/


--
Jennifer Hillman-Jackson
http://galaxyproject.org

___
The Galaxy User List is being replaced by the Galaxy Biostar
User Support Forum at https://biostar.usegalaxy.org/

Posts to this list will be disabled in May 2014.  In the
meantime, you are encouraged to post all new questions to
Galaxy Biostar.

For discussion of local Galaxy instances and the Galaxy
source code, please use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

  http://galaxyproject.org/search/mailinglists/

[galaxy-user] Megablast database identity

2014-04-27 Thread Scott W. Tighe

Jennifer

I am megablasting a simple 500,000 line dataset that is certainly in  
galaxy fasta.


For a week i have been seeing numerous errors. So i have reprocesed  
the data multiple times.


The error message is  could not find specified database directory

Is there an alternative approach? I did try ncbi directly and that  
failed too. They say they have a database issue



Scott Tighe


Core Laboratory Research Staff
Advanced Genome Technologies Core
Deep Sequencing (MPS) Facility
Vermont Cancer Center
149 Beaumont Ave
University of Vermont HSRF 303
Burlington Vermont  USA 05045
802-656-AGTC
802-999- (cell)





___
The Galaxy User List is being replaced by the Galaxy Biostar
User Support Forum at https://biostar.usegalaxy.org/

Posts to this list will be disabled in May 2014.  In the
meantime, you are encouraged to post all new questions to
Galaxy Biostar.

For discussion of local Galaxy instances and the Galaxy
source code, please use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:

 http://galaxyproject.org/search/mailinglists/


Re: [galaxy-user] MegaBLAST output

2012-04-25 Thread Sandrine Hughes
Dear all,

Sometimes ago, I’ve reported on this list the same problem with
megablast than Sarah mentioned. I finally used another way to analyse
my data but my conclusion was similar to Sarah one with most of the
time a shift of « -1 » between the GI number in the output and the
following parameters in the other columns.
One of the possible explanations was maybe a change in the GI number
according to the version of the databank used. However, I’m not so
sure that this explanation is the right one and I’m not so sure about
the consistency of this « GI-1 » rule.

Indeed, I’ve made a simple test by using a known and well identified
sequence taken at random in Genbank (GI 2924630 in NCBI, AB002412.1,
Elephas maximus mitochondrial DNA for cytochrome b, 1137 bp, so called
TEST-SEQ below). Using this sequence as template for megablast in
Galaxy gave the following results (3 first lines):

TEST-SEQ 2924736 1137 100.00 1137 0 0 1 1137 1 1137 0.0 2254.0
TEST-SEQ 155573765 16831 99.91 1137 1 0 1 1137 14149 15285 0.0 2246.0
TEST-SEQ 2924608 1137 99.74 1137 3 0 1 1137 1 1137 0.0 2230.0

As you can see, the parameters concerning the first match seem correct
: 1137 bp and 100% of identity. However, the GI number is not the good
one (2924736 instead of 2924630) and we don’t have a « GI-1 » rule
here. I used the Fetch Taxonomic Representation command available in
Galaxy to fetch the taxonomy for the 3 sequences above and 2 out 3 GI
numbers correspond to very distant taxa…
I did a megablast on the NCBI (so, the database is more recent) to
compare and here are the results :

TEST-SEQ gi|2924630|dbj|AB002412.1| 100.00 1137 0 0 1 1137 1 1137 0.0 2100
TEST-SEQ gi|37496447|emb|AJ428946.1| 99.91 1137 1 0 1 1137 14149 15285 0.0 2095
TEST-SEQ gi|2924606|dbj|D50844.1| 99.74 1137 3 0 1 1137 1 1137 0.0 2084

At this step, I would say that the table given in output of the
megablast in Galaxy is good for all parameters except for the GI of
the database hit (column 2)…and I'm not able to say why. I’m not sure
that it can help… Hoping that you will be able to clarify this
possible problem in the megablast output,
Best,

Sandrine


2012/4/24 Sarah Hicks garlicsc...@gmail.com:
 So, Jen, I'm not sure if we're talking about the same ID change... I
 am under the impression that GenBank does not change it's GI numbers
 for it's entries. Plus, it's now looking like all sequence length
 output info for each hit through Galaxy's megablast does not match to
 the GI number output given by Galaxy megablast, but to the GI number
 before it. Because the -1 rule is so consistent, it makes this seem
 less and less like it has to do with NCBI changing it's GI numbers to
 make room for new entries or something. In other words, there is a
 shift, as if a 1 was added to each NCBI GI number in galaxy before
 galaxy produces the output file.

 I need someone to tell me if I can trust the output. Basically I see
 it this way. Every hit row from Galaxy megablast actually has
 information for two NCBI entries: the one that shares the GI output
 and the one before it that shares the sequence length output. Which
 one is the hit I should be using?
 Because on some occations, the NCBI entry that shares the GI output
 from galaxy is VERY distantly related to the NCBI entry that shares
 the subject sequence length output from galaxy, and I don't know which
 to pick. Is this problem well understood, yet?


 On Tue, Apr 24, 2012 at 10:52 AM, Peter Cock p.j.a.c...@googlemail.com 
 wrote:
 On Mon, Apr 23, 2012 at 11:41 PM, Sarah Hicks garlicsc...@gmail.com wrote:
 Peter, you requested an example, here are the first five hits for my
 first query sequence (OTU#0)

 0       324034994       527     93.23   266     13      5       1       265 
     22      283     7e-102  379.0
 0       56181650        513     93.26   267     10      8       1       265 
     25      285     7e-102  379.0
 0       314913953       582     91.79   268     13      9       1       265 
     24      285     2e-92   347.0
 0       305670062       281     92.52   254     14      5       4       256 
     32      281     2e-92   347.0
 0       310814066       1180    91.73   266     14      7       1       265 
     24      282     9e-92   345.0

 You will notice there are 13 columns, one in addition to the 12 column
 titles you explained. This is because there is a column between sseqID
 and pident.

 I see now - the megablast_wrapper.py calls megablast (from the old legacy
 NCBI blast suite) which does indeed produce 12 column tabular output. But
 the wrapper script then edits the output:

 It appears to be splitting column 2 in two at the underscore intended to
 give the match ID and the length. This puzzles me but I haven't used
 the legacy BLAST tabular output for a while. On BLAST+ you can ask
 for the query or subject length explicitly as their own columns so we
 don't have this problem.

 The megablast_wrapper.py also re-formats the floating point score in the
 last column, apparently the 

Re: [galaxy-user] MegaBLAST output

2012-04-24 Thread Jennifer Jackson

Hi Sarah,

We appreciate all of the information you have provided and have been 
working here since yesterday to investigate the issue in more detail. 
This includes incorporating the additional data both you and Peter have 
been posting.


We don't have anything conclusive to report yet, but it would have been 
considerate to send an update this morning to let you know what we were 
doing. Please accept my apologies for not doing so - we are in fact in 
complete agreement that as the data currently presents, something odd 
appears to be going on.  Genbank updates would be unrelated as gi 
numbers do not change through time (although they can be retired, but 
again, not related to this case). The question of the mismatch in the 
wrapped Megablast output between gi and reported length is the open 
issue to be addressed.


A reply will be send as soon as the root cause is determined. If there 
is indeed a problem, this would of course be considered a priority to 
correct. Not that we are expecting delays, but if your analysis is very 
urgent, using the BLAST+ BLASTN megablast wrapper that Peter authored, 
in a local or cloud instance, would be the best immediate remedy (this 
version has the standard 12 column output). Sequence length data could 
always be obtained from Genbank and added into these results using other 
Galaxy tools (column join, etc.).


Thank you and Peter both for the help and for your patience!

Best,

Jen
Galaxy team



On 4/24/12 1:50 PM, Sarah Hicks wrote:

So, Jen, I'm not sure if we're talking about the same ID change... I
am under the impression that GenBank does not change it's GI numbers
for it's entries. Plus, it's now looking like all sequence length
output info for each hit through Galaxy's megablast does not match to
the GI number output given by Galaxy megablast, but to the GI number
before it. Because the -1 rule is so consistent, it makes this seem
less and less like it has to do with NCBI changing it's GI numbers to
make room for new entries or something. In other words, there is a
shift, as if a 1 was added to each NCBI GI number in galaxy before
galaxy produces the output file.

I need someone to tell me if I can trust the output. Basically I see
it this way. Every hit row from Galaxy megablast actually has
information for two NCBI entries: the one that shares the GI output
and the one before it that shares the sequence length output. Which
one is the hit I should be using?
Because on some occations, the NCBI entry that shares the GI output
from galaxy is VERY distantly related to the NCBI entry that shares
the subject sequence length output from galaxy, and I don't know which
to pick. Is this problem well understood, yet?


On Tue, Apr 24, 2012 at 10:52 AM, Peter Cockp.j.a.c...@googlemail.com  wrote:

On Mon, Apr 23, 2012 at 11:41 PM, Sarah Hicksgarlicsc...@gmail.com  wrote:

Peter, you requested an example, here are the first five hits for my
first query sequence (OTU#0)

0   324034994   527 93.23   266 13  5   1   265 
22  283 7e-102  379.0
0   56181650513 93.26   267 10  8   1   265 
25  285 7e-102  379.0
0   314913953   582 91.79   268 13  9   1   265 
24  285 2e-92   347.0
0   305670062   281 92.52   254 14  5   4   256 
32  281 2e-92   347.0
0   310814066   118091.73   266 14  7   1   265 
24  282 9e-92   345.0

You will notice there are 13 columns, one in addition to the 12 column
titles you explained. This is because there is a column between sseqID
and pident.


I see now - the megablast_wrapper.py calls megablast (from the old legacy
NCBI blast suite) which does indeed produce 12 column tabular output. But
the wrapper script then edits the output:

It appears to be splitting column 2 in two at the underscore intended to
give the match ID and the length. This puzzles me but I haven't used
the legacy BLAST tabular output for a while. On BLAST+ you can ask
for the query or subject length explicitly as their own columns so we
don't have this problem.

The megablast_wrapper.py also re-formats the floating point score in the
last column, apparently the NCBI style could cause problems with the
Galaxy filter tool.


In the metagenomic tutorial the first 4 columns are
explained, and column 3 is described as length of sequence in database
(or length of the subject sequence).

This is the problem column. The length of only one of the subject GI
numbers above match the subject length in NCBI. This has caused me to
wonder if I can trust the hit info. In all cases that I've checked,
when this happens the correct match is the listed GI value minus 1
(ie, in NCBI, gi|324034994 is not 527nt long, but 324034993 IS 527nt
long).


That is strange.

Peter




--
Jennifer Jackson
http://galaxyproject.org
___
The Galaxy User 

Re: [galaxy-user] MegaBLAST output

2012-04-24 Thread Peter Cock
On Tue, Apr 24, 2012 at 10:24 PM, Jennifer Jackson j...@bx.psu.edu wrote:

 ..., using
 the BLAST+ BLASTN megablast wrapper that Peter authored, in a local or cloud
 instance, would be the best immediate remedy (this version has the standard
 12 column output). Sequence length data could always be obtained from
 Genbank and added into these results using other Galaxy tools (column join,
 etc.).

Getting the query and match sequence lengths is even simpler that
that with the BLAST+ wrappers - just select the extended tabular output.
Of course, you'll need to adjust the downstream analysis to take into
account the different column numbers.

Peter
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] MegaBLAST output

2012-04-24 Thread Jennifer Jackson

Thanks Peter,

Excellent point. From there, the Cut tool could be used to reorganize 
the output to exactly match that of the 13-column regular megablast 
output. So, no external data needed, no tool modifications needed.


This can't be done on the main public Galaxy instance as BLAST+ is not 
available there, but for any local/cloud instance this is an alternative 
certainly work testing.


Best,

Jen
Galaxy team

On 4/24/12 2:36 PM, Peter Cock wrote:

On Tue, Apr 24, 2012 at 10:24 PM, Jennifer Jacksonj...@bx.psu.edu  wrote:


..., using
the BLAST+ BLASTN megablast wrapper that Peter authored, in a local or cloud
instance, would be the best immediate remedy (this version has the standard
12 column output). Sequence length data could always be obtained from
Genbank and added into these results using other Galaxy tools (column join,
etc.).


Getting the query and match sequence lengths is even simpler that
that with the BLAST+ wrappers - just select the extended tabular output.
Of course, you'll need to adjust the downstream analysis to take into
account the different column numbers.

Peter


--
Jennifer Jackson
http://galaxyproject.org
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


[galaxy-user] MegaBLAST output

2012-04-23 Thread Sarah Hicks
I am having trouble finding information on the MegaBLAST output
columns. What is each column for? I can't seem to figure this out by
comparing info in the columns to NCBI directly because the GI#'s don't
match with the correct entry on NCBI. I've seen that others have
posted about that problem, so I'm also waiting on details on that
question, but for now, I'd just like to know what to make of the
output...
best,
Sarah
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] MegaBLAST output

2012-04-23 Thread Jennifer Jackson

Hi Sarah,

Peter defined the columns (thanks) but I can provide some information 
about the GenBank identifiers. The megablast database on the public 
server are roughly a year old and there have been updates at NCBI since 
that time. As I understand it, this manifests as occasional mismatches 
between hits at Galaxy vs Genbank when comparing certain IDs linked to 
updated records.


We are working to update these three databases, but there are some 
complicating factors around this processing specifically related to the 
public instance and the metagenomics workflow that have yet to be 
resolved. Please know that getting updated is a priority for us and we 
apologize for the inconvenience.


To use the most current databases, a local or (better) cloud instance 
with either the regular or BLAST+ version of the tool and a database 
your choice is the recommendation. Instructions to get started are at:

getgalaxy.org
getgalaxy.org/cloud

Hopefully this explains the data mismatch. This question has come up 
before, but I think you are correct in that the final conclusion never 
was posted back to the galaxy-user list (for different reasons). So, 
thank you for asking so we that could send out a clear reply for 
everyone using the tool.


Best,

Jen
Galaxy team

On 4/23/12 9:56 AM, Sarah Hicks wrote:

I am having trouble finding information on the MegaBLAST output
columns. What is each column for? I can't seem to figure this out by
comparing info in the columns to NCBI directly because the GI#'s don't
match with the correct entry on NCBI. I've seen that others have
posted about that problem, so I'm also waiting on details on that
question, but for now, I'd just like to know what to make of the
output...
best,
Sarah
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/


--
Jennifer Jackson
http://galaxyproject.org
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] MegaBLAST output

2012-04-23 Thread Sarah Hicks
Thanks so much for the prompt reply. I don't mind using last years
GenBank, as long as I am getting accurate hits. I just have a couple
more questions to confirm I am safe using the Galaxy pipline for
this...
So if I continue to work within the the 1 year old database, can I
trust the output as accurate matches? Specifics about my project: I
have environmental samples that were sequenced for fungal ITS. I have
clustered these into OTUs, and chosen a representative sequence for
each. If I retrieve hits for this representative sequence file in my
sample, can I trust the hits as being the correct hits as of last
year? I'm just worried about what that one person said who thought
there was some column arrangement problems, because I'm finding that
I'm getting hits from different phylum for the same sequence using
default parameters in megablast...
Can I also assume, then, that I should NOT identify my representative
sequence file to updated GI numbers using another pipeline, and then
bring the file of GI numbers to Galaxy to fetch taxonomic assignments?
(which I would do because of the nice neat columns for each taxonomic
level Galaxy puts out)

Sarah

On Mon, Apr 23, 2012 at 2:26 PM, Jennifer Jackson j...@bx.psu.edu wrote:
 Hi Sarah,

 Peter defined the columns (thanks) but I can provide some information about
 the GenBank identifiers. The megablast database on the public server are
 roughly a year old and there have been updates at NCBI since that time. As I
 understand it, this manifests as occasional mismatches between hits at
 Galaxy vs Genbank when comparing certain IDs linked to updated records.

 We are working to update these three databases, but there are some
 complicating factors around this processing specifically related to the
 public instance and the metagenomics workflow that have yet to be resolved.
 Please know that getting updated is a priority for us and we apologize for
 the inconvenience.

 To use the most current databases, a local or (better) cloud instance with
 either the regular or BLAST+ version of the tool and a database your choice
 is the recommendation. Instructions to get started are at:
 getgalaxy.org
 getgalaxy.org/cloud

 Hopefully this explains the data mismatch. This question has come up before,
 but I think you are correct in that the final conclusion never was posted
 back to the galaxy-user list (for different reasons). So, thank you for
 asking so we that could send out a clear reply for everyone using the tool.

 Best,

 Jen
 Galaxy team


 On 4/23/12 9:56 AM, Sarah Hicks wrote:

 I am having trouble finding information on the MegaBLAST output
 columns. What is each column for? I can't seem to figure this out by
 comparing info in the columns to NCBI directly because the GI#'s don't
 match with the correct entry on NCBI. I've seen that others have
 posted about that problem, so I'm also waiting on details on that
 question, but for now, I'd just like to know what to make of the
 output...
 best,
 Sarah
 ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org.  Please keep all replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

 To manage your subscriptions to this and other Galaxy lists,
 please use the interface at:

   http://lists.bx.psu.edu/


 --
 Jennifer Jackson
 http://galaxyproject.org

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] MegaBLAST output

2012-04-23 Thread Sarah Hicks
Peter, you requested an example, here are the first five hits for my
first query sequence (OTU#0)

0   324034994   527 93.23   266 13  5   1   265 
22  283 7e-102  379.0
0   56181650513 93.26   267 10  8   1   265 
25  285 7e-102  379.0
0   314913953   582 91.79   268 13  9   1   265 
24  285 2e-92   347.0
0   305670062   281 92.52   254 14  5   4   256 
32  281 2e-92   347.0
0   310814066   118091.73   266 14  7   1   265 
24  282 9e-92   345.0

You will notice there are 13 columns, one in addition to the 12 column
titles you explained. This is because there is a column between sseqID
and pident. In the metagenomic tutorial the first 4 columns are
explained, and column 3 is described as length of sequence in database
(or length of the subject sequence).

This is the problem column. The length of only one of the subject GI
numbers above match the subject length in NCBI. This has caused me to
wonder if I can trust the hit info. In all cases that I've checked,
when this happens the correct match is the listed GI value minus 1
(ie, in NCBI, gi|324034994 is not 527nt long, but 324034993 IS 527nt
long).



On Mon, Apr 23, 2012 at 11:05 AM, Peter Cock p.j.a.c...@googlemail.com wrote:
 On Mon, Apr 23, 2012 at 5:56 PM, Sarah Hicks garlicsc...@gmail.com wrote:
 I am having trouble finding information on the MegaBLAST output
 columns. What is each column for? I can't seem to figure this out by
 comparing info in the columns to NCBI directly because the GI#'s don't
 match with the correct entry on NCBI. I've seen that others have
 posted about that problem, so I'm also waiting on details on that
 question, but for now, I'd just like to know what to make of the
 output...
 best,
 Sarah

 I've not tried to track down this reported possible bug in GI numbers,
 and weather it also affects BLAST+ as well as the legacy NCBI BLAST
 (which has now been discontinued). Do you have a specific example.

 As to the 12 columns, they are standard BLAST tabular output, and
 should match the defaults in BLAST+ tabular output which are:

 Column  NCBI name       Description
 1       qseqid  Query Seq-id (ID of your sequence)
 2       sseqid  Subject Seq-id (ID of the database hit)
 3       pident  Percentage of identical matches
 4       length  Alignment length
 5       mismatch        Number of mismatches
 6       gapopen         Number of gap openings
 7       qstart  Start of alignment in query
 8       qend    End of alignment in query
 9       sstart  Start of alignment in subject (database hit)
 10      send    End of alignment in subject (database hit)
 11      evalue  Expectation value (E-value)
 12      bitscore        Bit score

 Peter

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] Megablast question

2012-04-11 Thread Jennifer Jackson

Hi Vasu,

The three primary megablast databases available on the public main 
Galaxy instance are comprised of individual fragments/sequences of 
different types from many species (not assembled genomes):

http://user.list.galaxyproject.org/Question-about-megablast-td4543260.html

If you want to use megablast to map against specific assembled genomes, 
then using a local or (better) cloud instance is recommended. In your 
own instance, the individual genomes would be set up the way that the 
'phiX174' is set up on main.

To get started, please see: http://getgalaxy.org

Does this address your question? If not, perhaps you could explain more 
what your goal is and we can try to offer help or confirm that this is 
the best path?


Regards,

Jen
Galaxy team

On 4/10/12 11:04 AM, shamsher jagat wrote:

Hi,
I am using megablast and was wondering how can I get chromosome number
and coordinates of its hits.

Thanks

Shamesher


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


[galaxy-user] Megablast question

2012-04-10 Thread shamsher jagat
Hi,
I am using megablast and was wondering how can I get chromosome number
and coordinates of its hits.

Thanks

Shamesher
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] Megablast

2012-02-21 Thread Jennifer Jackson

Hello Scott,

For #1, option -p:

Here is a link to some megablast parameter documentation online:
http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/megablast.html#3
(the primary paper for the Galaxy tool is noted at the bottom of the 
tool form, but this is convenient)


Quote:

Table 3.30 Parameter -p
FunctionSpecifies the percentage identity cut-off
Default 0
Input format[Real]
Example To set percent id cutoff to 75%, use: -p 75
Note: The input value range is between 0 and 100, with 0 meaning no 
cutoff. It only works on the aligned region or individual HSPs.


For #2, there are a few ways to interpret filter. If you mean will 
megablast consider the adapter part of the sequence in calculations, the 
answer is that it does for some and doesn't for others. The part of the 
sequence that is adapter wouldn't align to the genome, and percent 
identity is only based on HSPs (high scoring pairs - one part of the 
pair is the DNA query and the other is the genome target, for that 
alignment region only). So, adapter sequence wouldn't be involved in 
percent identify calculations (or be expected to!). But, these unaligned 
regions could become a problem if coverage or certain other statistics 
were part of your analysis. Learning about the statistics you choose to 
use, to see if query length is part of the calculation, will let you 
know if clipping is necessary. If important, removing adapters can be 
done with tools in NGS: QC and manipulation (perform a tool search on 
keywords trim or clip.



Best,

Jen
Galaxy team

On 2/20/12 4:59 PM, Scott Tighe wrote:

Hi Galaxy users

When Magablasting

1)what does the identity value -p mean ...is it percent identity?
I want my megablast results to be reported form only a 100% match. I do
not see a place for % alinement concordance.
2) form my Illumina Hiseq reads, are the adaptor sequences filtered
during the filter step?

Scott tighe

--2
Scott Tighe
Advanced Genome Technology Lab
Vermont Cancer Center at the University of Vermont
149 Beaumont Avenue
Health Science Research Bd RM 305
Burlington Vermont USA 05405
lab  802-656-AGTC (2482)
cell 802-999-



___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/


--
Jennifer Jackson
http://usegalaxy.org
http://galaxyproject.org/wiki/Support
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] Megablast question

2012-02-16 Thread Noa Sher

  
  
Hi Scott
I never used megablast so what i am writing is true of just any
  fasta file (so if there is anything quirky in megablast that i
  dont know about, apologies!):

  Take your fasta file and convert to tabular (under "fasta
manipulation" - this will make it go to one line per record).
  
  Then randomly choose whatever number of reads you want using
"select random lines from a file" under the text maniupulation
tab.
  Then convert the tabular file back to fasta. (under the fasta
manipulation tab)
  

noa
On 16/02/2012 19:31, Scott Tighe wrote:

  
  Hi all

When using Galaxy megablast, is there a simple way to reduce my
FASTA files from 23 million reads to 1/2 that size and submit to
megablast separately?

Thanks
  
  -- 
Scott Tighe
Advanced Genome Technology Lab
Vermont Cancer Center at the University of Vermont
149 Beaumont Avenue
Health Science Research Bd RM 305
Burlington Vermont USA 05405
lab  802-656-AGTC (2482)
cell 802-999-

  
  
  
  ___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using "reply all" in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

  

___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/

Re: [galaxy-user] Megablast question

2012-02-16 Thread Dannon Baker
Noa has the right idea, but if you're asking for how to split a dataset into 
two non-overlapping halves you'll want to use Select First and Select Last, 
instead of random lines.  Get an accurate line count from your file using the 
Line/Word/Character count tool and then split it right in the middle using 
select first/last.

-Dannon

On Feb 16, 2012, at 2:35 PM, Noa Sher wrote:

 Hi Scott
 I  never used megablast so what i am writing is true of just any fasta file 
 (so if there is anything quirky in megablast that i dont know about, 
 apologies!):
   • Take your fasta file and convert to tabular (under fasta 
 manipulation - this will make it go to one line per record).
   • Then randomly choose whatever number of reads you want using select 
 random lines from a file under the text maniupulation tab.
   • Then convert the tabular file back to fasta. (under the fasta 
 manipulation tab)
 noa
 On 16/02/2012 19:31, Scott Tighe wrote:
 Hi all
 
 When using Galaxy megablast, is there a simple way to reduce my FASTA files 
 from 23 million reads to 1/2 that size and submit to megablast separately?
 
 Thanks
 -- 
 Scott Tighe
 Advanced Genome Technology Lab
 Vermont Cancer Center at the University of Vermont
 149 Beaumont Avenue
 Health Science Research Bd RM 305
 Burlington Vermont USA 05405
 lab  802-656-AGTC (2482)
 cell 802-999-
 
 
 
 ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org.  Please keep all replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:
 
   
 http://lists.bx.psu.edu/listinfo/galaxy-dev
 
 
 To manage your subscriptions to this and other Galaxy lists,
 please use the interface at:
 
   
 http://lists.bx.psu.edu/
 ___
 The Galaxy User list should be used for the discussion of
 Galaxy analysis and other features on the public server
 at usegalaxy.org.  Please keep all replies on the list by
 using reply all in your mail client.  For discussion of
 local Galaxy instances and the Galaxy source code, please
 use the Galaxy Development list:
 
  http://lists.bx.psu.edu/listinfo/galaxy-dev
 
 To manage your subscriptions to this and other Galaxy lists,
 please use the interface at:
 
  http://lists.bx.psu.edu/


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/