[galaxy-user] Creating a galaxy tool in R - You must not use 8-bit bytestrings

2012-04-24 Thread Dan Tenenbaum
Hello,

I'm a galaxy newbie and running into several issues trying to adapt an
R script to be a galaxy tool.

I'm looking at the XY plotting tool for guidance
(tools/plot/xy_plot.xml), but I decided not to embed my script in XML,
but instead have it in a separate script file, that way I can still
run it from the command line and make sure it works as I make
incremental changes. (So my script starts with args -
commandArgs(TRUE)). Also, if it doesn't work, this suggests to me that
there is a problem with my galaxy configuration.

First, I tried using the r_wrapper.sh script that comes with the XY
plotting tool,  but it threw away my arguments:

An error occurred running this job: ARGUMENT
'/Users/dtenenba/dev/galaxy-dist/database/files/000/dataset_4.dat'
__ignored__

ARGUMENT '/Users/dtenenba/dev/galaxy-dist/database/files/000/dataset_3.dat'
__ignored__

ARGUMENT 'Fly' __ignored__

ARGUMENT 'Tagwise' __ignored__

etc.

So then I tried just switching to Rscript:

  command interpreter=bashRscript RNASeq.R $countsTsv $designTsv
$organism $dispersion $minimumCountsPerMillion
$minimumSamplesPerTranscript $out_file1 $out_file2/command

(My script produces as output a csv file and a pdf file. The final two
arguments I'm passing are the names of those files.)

But then I get an error that Rscript can't be found.

So I wrote a little wrapper script, Rscript_wrapper.sh:

#!/bin/sh

Rscript $*

And called that:
  command interpreter=bashRscript_wrapper.sh RNASeq.R $countsTsv
$designTsv $organism $dispersion $minimumCountsPerMillion
$minimumSamplesPerTranscript $out_file1 $out_file2/command

Then I got an error that RNASeq.R could not be found.

So then I added the absolute path to my R script to the command tag.
This seemed to work (that is, it got me further, to the next error),
but I'm not sure why I had to do this; in all the other tools I'm
looking at, the directory to the script to run does not have to be
specified; I assumed that the command would run in the appropriate
directory.

So now I've specified the full path to my R script:

  command interpreter=bashRscript_wrapper.sh
/Users/dtenenba/dev/galaxy-dist/tools/bioc/RNASeq.R $countsTsv
$designTsv $organism $dispersion $minimumCountsPerMillion
$minimumSamplesPerTranscript $out_file1 $out_file2/command

And I get the following long error, which includes all of the output
of my R script:

Traceback (most recent call last):
  File /Users/dtenenba/dev/galaxy-dist/lib/galaxy/jobs/runners/local.py,
line 133, in run_job
job_wrapper.finish( stdout, stderr )
  File /Users/dtenenba/dev/galaxy-dist/lib/galaxy/jobs/__init__.py,
line 725, in finish
self.sa_session.flush()
  File 
/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/scoping.py,
line 127, in do
return getattr(self.registry(), name)(*args, **kwargs)
  File 
/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/session.py,
line 1356, in flush
self._flush(objects)
  File 
/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/session.py,
line 1434, in _flush
flush_context.execute()
  File 
/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/unitofwork.py,
line 261, in execute
UOWExecutor().execute(self, tasks)
  File 
/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/unitofwork.py,
line 753, in execute
self.execute_save_steps(trans, task)
  File 
/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/unitofwork.py,
line 768, in execute_save_steps
self.save_objects(trans, task)
  File 
/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/unitofwork.py,
line 759, in save_objects
task.mapper._save_obj(task.polymorphic_tosave_objects, trans)
  File 
/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/mapper.py,
line 1413, in _save_obj
c = connection.execute(statement.values(value_params), params)
  File 
/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/engine/base.py,
line 824, in execute
return Connection.executors[c](self, object, multiparams, params)
  File 
/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/engine/base.py,
line 874, in _execute_clauseelement
return self.__execute_context(context)
  File 
/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/engine/base.py,
line 896, in __execute_context
self._cursor_execute(context.cursor, context.statement,
context.parameters[0], context=context)
  File 
/Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/engine/base.py,
line 950, in _cursor_execute
self._handle_dbapi_exception(e, statement, parameters, cursor, context)
  File 

Re: [galaxy-user] MegaBLAST output

2012-04-24 Thread Jennifer Jackson

Hi Sarah,

We appreciate all of the information you have provided and have been 
working here since yesterday to investigate the issue in more detail. 
This includes incorporating the additional data both you and Peter have 
been posting.


We don't have anything conclusive to report yet, but it would have been 
considerate to send an update this morning to let you know what we were 
doing. Please accept my apologies for not doing so - we are in fact in 
complete agreement that as the data currently presents, something odd 
appears to be going on.  Genbank updates would be unrelated as gi 
numbers do not change through time (although they can be retired, but 
again, not related to this case). The question of the mismatch in the 
wrapped Megablast output between gi and reported length is the open 
issue to be addressed.


A reply will be send as soon as the root cause is determined. If there 
is indeed a problem, this would of course be considered a priority to 
correct. Not that we are expecting delays, but if your analysis is very 
urgent, using the BLAST+ BLASTN megablast wrapper that Peter authored, 
in a local or cloud instance, would be the best immediate remedy (this 
version has the standard 12 column output). Sequence length data could 
always be obtained from Genbank and added into these results using other 
Galaxy tools (column join, etc.).


Thank you and Peter both for the help and for your patience!

Best,

Jen
Galaxy team



On 4/24/12 1:50 PM, Sarah Hicks wrote:

So, Jen, I'm not sure if we're talking about the same ID change... I
am under the impression that GenBank does not change it's GI numbers
for it's entries. Plus, it's now looking like all sequence length
output info for each hit through Galaxy's megablast does not match to
the GI number output given by Galaxy megablast, but to the GI number
before it. Because the -1 rule is so consistent, it makes this seem
less and less like it has to do with NCBI changing it's GI numbers to
make room for new entries or something. In other words, there is a
shift, as if a 1 was added to each NCBI GI number in galaxy before
galaxy produces the output file.

I need someone to tell me if I can trust the output. Basically I see
it this way. Every hit row from Galaxy megablast actually has
information for two NCBI entries: the one that shares the GI output
and the one before it that shares the sequence length output. Which
one is the hit I should be using?
Because on some occations, the NCBI entry that shares the GI output
from galaxy is VERY distantly related to the NCBI entry that shares
the subject sequence length output from galaxy, and I don't know which
to pick. Is this problem well understood, yet?


On Tue, Apr 24, 2012 at 10:52 AM, Peter Cockp.j.a.c...@googlemail.com  wrote:

On Mon, Apr 23, 2012 at 11:41 PM, Sarah Hicksgarlicsc...@gmail.com  wrote:

Peter, you requested an example, here are the first five hits for my
first query sequence (OTU#0)

0   324034994   527 93.23   266 13  5   1   265 
22  283 7e-102  379.0
0   56181650513 93.26   267 10  8   1   265 
25  285 7e-102  379.0
0   314913953   582 91.79   268 13  9   1   265 
24  285 2e-92   347.0
0   305670062   281 92.52   254 14  5   4   256 
32  281 2e-92   347.0
0   310814066   118091.73   266 14  7   1   265 
24  282 9e-92   345.0

You will notice there are 13 columns, one in addition to the 12 column
titles you explained. This is because there is a column between sseqID
and pident.


I see now - the megablast_wrapper.py calls megablast (from the old legacy
NCBI blast suite) which does indeed produce 12 column tabular output. But
the wrapper script then edits the output:

It appears to be splitting column 2 in two at the underscore intended to
give the match ID and the length. This puzzles me but I haven't used
the legacy BLAST tabular output for a while. On BLAST+ you can ask
for the query or subject length explicitly as their own columns so we
don't have this problem.

The megablast_wrapper.py also re-formats the floating point score in the
last column, apparently the NCBI style could cause problems with the
Galaxy filter tool.


In the metagenomic tutorial the first 4 columns are
explained, and column 3 is described as length of sequence in database
(or length of the subject sequence).

This is the problem column. The length of only one of the subject GI
numbers above match the subject length in NCBI. This has caused me to
wonder if I can trust the hit info. In all cases that I've checked,
when this happens the correct match is the listed GI value minus 1
(ie, in NCBI, gi|324034994 is not 527nt long, but 324034993 IS 527nt
long).


That is strange.

Peter




--
Jennifer Jackson
http://galaxyproject.org
___
The Galaxy User 

Re: [galaxy-user] MegaBLAST output

2012-04-24 Thread Peter Cock
On Tue, Apr 24, 2012 at 10:24 PM, Jennifer Jackson j...@bx.psu.edu wrote:

 ..., using
 the BLAST+ BLASTN megablast wrapper that Peter authored, in a local or cloud
 instance, would be the best immediate remedy (this version has the standard
 12 column output). Sequence length data could always be obtained from
 Genbank and added into these results using other Galaxy tools (column join,
 etc.).

Getting the query and match sequence lengths is even simpler that
that with the BLAST+ wrappers - just select the extended tabular output.
Of course, you'll need to adjust the downstream analysis to take into
account the different column numbers.

Peter
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

  http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-user] How to replace ensembl gene ID with gene names in Cuffdiff output?

2012-04-24 Thread Jennifer Jackson

Hi Wei,

BioMart has tools to extract tabular data that maps Ensembl transcript 
identifiers to alternate identifiers, gene symbols, etc. See the tool 
under Get Data - BioMart Central server. You'll likely have to map 
from Ensembl transcriptID - HGNC transcript - HGNC gene


http://uswest.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0223972;r=1:11872-14412;t=ENST0515242

Other sources also have this type of data:
http://www.genenames.org/data/hgnc_data.php?hgnc_id=37102

The Cuffdiff file will also need to be prepared, the Ensembl transcripts 
need to be in tabular format. 'Convert Delimiters to Tab' is the correct 
tool choice.


Then, once both files have the data you wish to you in tabular format, 
join the data on common keys using tools in 'Join, Subtract and Group - 
Column Join'.


To finish, use tools in 'Text Manipulation', 'Filter and Sort', and 
'Join, Subtract and Group' to format the data so that it is useful for 
your purposes.


If you need help, we have screencasts cover most of these text 
manipulation operations: Galaxy 101 + Tool tutorials (top 6)

http://wiki.g2.bx.psu.edu/Learn/Screencasts

Hopefully this helps,

Jen
Galaxy team

On 4/19/12 1:43 PM, Wei Liao wrote:

Hi all,
I had the following Cuffdiff output from genes defferetial expression
testing:
test_id gene_id genelocus   sample_1sample_2
status  value_1
value_2 log2(fold_change)   test_stat   p_value q_value
XLOC_01 XLOC_01
ENST0450305,ENST0456328,ENST0515242,ENST0518655
chr1:11868-31109NORMAL  1   OK  0.5587970.84004 
0.588134-0.44598
0.6556110.767628

There are four transcript IDs belong to gene DDXL11L1, my question is
how to replace these ID with official gene names?
Thanks,

--
Wei Liao
Research Scientist,
Brentwood Biomedical Research Institute
16111 Plummer St.
Bldg 7, Rm D-122
North Hills, CA 91343
818-891-7711 ext 7645



___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/


--
Jennifer Jackson
http://galaxyproject.org
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] MegaBLAST output

2012-04-24 Thread Jennifer Jackson

Thanks Peter,

Excellent point. From there, the Cut tool could be used to reorganize 
the output to exactly match that of the 13-column regular megablast 
output. So, no external data needed, no tool modifications needed.


This can't be done on the main public Galaxy instance as BLAST+ is not 
available there, but for any local/cloud instance this is an alternative 
certainly work testing.


Best,

Jen
Galaxy team

On 4/24/12 2:36 PM, Peter Cock wrote:

On Tue, Apr 24, 2012 at 10:24 PM, Jennifer Jacksonj...@bx.psu.edu  wrote:


..., using
the BLAST+ BLASTN megablast wrapper that Peter authored, in a local or cloud
instance, would be the best immediate remedy (this version has the standard
12 column output). Sequence length data could always be obtained from
Genbank and added into these results using other Galaxy tools (column join,
etc.).


Getting the query and match sequence lengths is even simpler that
that with the BLAST+ wrappers - just select the extended tabular output.
Of course, you'll need to adjust the downstream analysis to take into
account the different column numbers.

Peter


--
Jennifer Jackson
http://galaxyproject.org
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/


Re: [galaxy-user] Cuffdiff

2012-04-24 Thread Jennifer Jackson

Hi Ateequr,

This post from today has information another member found at 
seqanswers.com, directly from the CuffLinks/Merge/Diff tool author:

http://user.list.galaxyproject.org/Re-1-cuffcompare-or-cuffmerge-td4581029.html

Best,

Jen
Galaxy team

On 4/17/12 8:00 AM, Ateequr Rehman wrote:

Dear All
I have simple and question for cuffdiff
should we run cuffdif on merge transcript file (produced by cuffmerge)
and concatenate data sets
or directly on cufflink produced files, in the later case, i have two
transcript files resulting from cufflink on sample 1 and 2 respectively,
result using sample 1 as transcripts are not the same when i am suing
sample 2 as transcript

i am bit confused what should be the correct way

any help is very much welcomed

Best
ateeq
Ateequr Rehman
House No. 2 ground floor
Blauenstr. 10
79115 Freiburg im Breisgau


___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

   http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

   http://lists.bx.psu.edu/


--
Jennifer Jackson
http://galaxyproject.org
___
The Galaxy User list should be used for the discussion of
Galaxy analysis and other features on the public server
at usegalaxy.org.  Please keep all replies on the list by
using reply all in your mail client.  For discussion of
local Galaxy instances and the Galaxy source code, please
use the Galaxy Development list:

 http://lists.bx.psu.edu/listinfo/galaxy-dev

To manage your subscriptions to this and other Galaxy lists,
please use the interface at:

 http://lists.bx.psu.edu/