[galaxy-user] Creating a galaxy tool in R - You must not use 8-bit bytestrings
Hello, I'm a galaxy newbie and running into several issues trying to adapt an R script to be a galaxy tool. I'm looking at the XY plotting tool for guidance (tools/plot/xy_plot.xml), but I decided not to embed my script in XML, but instead have it in a separate script file, that way I can still run it from the command line and make sure it works as I make incremental changes. (So my script starts with args - commandArgs(TRUE)). Also, if it doesn't work, this suggests to me that there is a problem with my galaxy configuration. First, I tried using the r_wrapper.sh script that comes with the XY plotting tool, but it threw away my arguments: An error occurred running this job: ARGUMENT '/Users/dtenenba/dev/galaxy-dist/database/files/000/dataset_4.dat' __ignored__ ARGUMENT '/Users/dtenenba/dev/galaxy-dist/database/files/000/dataset_3.dat' __ignored__ ARGUMENT 'Fly' __ignored__ ARGUMENT 'Tagwise' __ignored__ etc. So then I tried just switching to Rscript: command interpreter=bashRscript RNASeq.R $countsTsv $designTsv $organism $dispersion $minimumCountsPerMillion $minimumSamplesPerTranscript $out_file1 $out_file2/command (My script produces as output a csv file and a pdf file. The final two arguments I'm passing are the names of those files.) But then I get an error that Rscript can't be found. So I wrote a little wrapper script, Rscript_wrapper.sh: #!/bin/sh Rscript $* And called that: command interpreter=bashRscript_wrapper.sh RNASeq.R $countsTsv $designTsv $organism $dispersion $minimumCountsPerMillion $minimumSamplesPerTranscript $out_file1 $out_file2/command Then I got an error that RNASeq.R could not be found. So then I added the absolute path to my R script to the command tag. This seemed to work (that is, it got me further, to the next error), but I'm not sure why I had to do this; in all the other tools I'm looking at, the directory to the script to run does not have to be specified; I assumed that the command would run in the appropriate directory. So now I've specified the full path to my R script: command interpreter=bashRscript_wrapper.sh /Users/dtenenba/dev/galaxy-dist/tools/bioc/RNASeq.R $countsTsv $designTsv $organism $dispersion $minimumCountsPerMillion $minimumSamplesPerTranscript $out_file1 $out_file2/command And I get the following long error, which includes all of the output of my R script: Traceback (most recent call last): File /Users/dtenenba/dev/galaxy-dist/lib/galaxy/jobs/runners/local.py, line 133, in run_job job_wrapper.finish( stdout, stderr ) File /Users/dtenenba/dev/galaxy-dist/lib/galaxy/jobs/__init__.py, line 725, in finish self.sa_session.flush() File /Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/scoping.py, line 127, in do return getattr(self.registry(), name)(*args, **kwargs) File /Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/session.py, line 1356, in flush self._flush(objects) File /Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/session.py, line 1434, in _flush flush_context.execute() File /Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/unitofwork.py, line 261, in execute UOWExecutor().execute(self, tasks) File /Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/unitofwork.py, line 753, in execute self.execute_save_steps(trans, task) File /Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/unitofwork.py, line 768, in execute_save_steps self.save_objects(trans, task) File /Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/unitofwork.py, line 759, in save_objects task.mapper._save_obj(task.polymorphic_tosave_objects, trans) File /Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/orm/mapper.py, line 1413, in _save_obj c = connection.execute(statement.values(value_params), params) File /Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/engine/base.py, line 824, in execute return Connection.executors[c](self, object, multiparams, params) File /Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/engine/base.py, line 874, in _execute_clauseelement return self.__execute_context(context) File /Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/engine/base.py, line 896, in __execute_context self._cursor_execute(context.cursor, context.statement, context.parameters[0], context=context) File /Users/dtenenba/dev/galaxy-dist/eggs/SQLAlchemy-0.5.6_dev_r6498-py2.7.egg/sqlalchemy/engine/base.py, line 950, in _cursor_execute self._handle_dbapi_exception(e, statement, parameters, cursor, context) File
Re: [galaxy-user] MegaBLAST output
Hi Sarah, We appreciate all of the information you have provided and have been working here since yesterday to investigate the issue in more detail. This includes incorporating the additional data both you and Peter have been posting. We don't have anything conclusive to report yet, but it would have been considerate to send an update this morning to let you know what we were doing. Please accept my apologies for not doing so - we are in fact in complete agreement that as the data currently presents, something odd appears to be going on. Genbank updates would be unrelated as gi numbers do not change through time (although they can be retired, but again, not related to this case). The question of the mismatch in the wrapped Megablast output between gi and reported length is the open issue to be addressed. A reply will be send as soon as the root cause is determined. If there is indeed a problem, this would of course be considered a priority to correct. Not that we are expecting delays, but if your analysis is very urgent, using the BLAST+ BLASTN megablast wrapper that Peter authored, in a local or cloud instance, would be the best immediate remedy (this version has the standard 12 column output). Sequence length data could always be obtained from Genbank and added into these results using other Galaxy tools (column join, etc.). Thank you and Peter both for the help and for your patience! Best, Jen Galaxy team On 4/24/12 1:50 PM, Sarah Hicks wrote: So, Jen, I'm not sure if we're talking about the same ID change... I am under the impression that GenBank does not change it's GI numbers for it's entries. Plus, it's now looking like all sequence length output info for each hit through Galaxy's megablast does not match to the GI number output given by Galaxy megablast, but to the GI number before it. Because the -1 rule is so consistent, it makes this seem less and less like it has to do with NCBI changing it's GI numbers to make room for new entries or something. In other words, there is a shift, as if a 1 was added to each NCBI GI number in galaxy before galaxy produces the output file. I need someone to tell me if I can trust the output. Basically I see it this way. Every hit row from Galaxy megablast actually has information for two NCBI entries: the one that shares the GI output and the one before it that shares the sequence length output. Which one is the hit I should be using? Because on some occations, the NCBI entry that shares the GI output from galaxy is VERY distantly related to the NCBI entry that shares the subject sequence length output from galaxy, and I don't know which to pick. Is this problem well understood, yet? On Tue, Apr 24, 2012 at 10:52 AM, Peter Cockp.j.a.c...@googlemail.com wrote: On Mon, Apr 23, 2012 at 11:41 PM, Sarah Hicksgarlicsc...@gmail.com wrote: Peter, you requested an example, here are the first five hits for my first query sequence (OTU#0) 0 324034994 527 93.23 266 13 5 1 265 22 283 7e-102 379.0 0 56181650513 93.26 267 10 8 1 265 25 285 7e-102 379.0 0 314913953 582 91.79 268 13 9 1 265 24 285 2e-92 347.0 0 305670062 281 92.52 254 14 5 4 256 32 281 2e-92 347.0 0 310814066 118091.73 266 14 7 1 265 24 282 9e-92 345.0 You will notice there are 13 columns, one in addition to the 12 column titles you explained. This is because there is a column between sseqID and pident. I see now - the megablast_wrapper.py calls megablast (from the old legacy NCBI blast suite) which does indeed produce 12 column tabular output. But the wrapper script then edits the output: It appears to be splitting column 2 in two at the underscore intended to give the match ID and the length. This puzzles me but I haven't used the legacy BLAST tabular output for a while. On BLAST+ you can ask for the query or subject length explicitly as their own columns so we don't have this problem. The megablast_wrapper.py also re-formats the floating point score in the last column, apparently the NCBI style could cause problems with the Galaxy filter tool. In the metagenomic tutorial the first 4 columns are explained, and column 3 is described as length of sequence in database (or length of the subject sequence). This is the problem column. The length of only one of the subject GI numbers above match the subject length in NCBI. This has caused me to wonder if I can trust the hit info. In all cases that I've checked, when this happens the correct match is the listed GI value minus 1 (ie, in NCBI, gi|324034994 is not 527nt long, but 324034993 IS 527nt long). That is strange. Peter -- Jennifer Jackson http://galaxyproject.org ___ The Galaxy User
Re: [galaxy-user] MegaBLAST output
On Tue, Apr 24, 2012 at 10:24 PM, Jennifer Jackson j...@bx.psu.edu wrote: ..., using the BLAST+ BLASTN megablast wrapper that Peter authored, in a local or cloud instance, would be the best immediate remedy (this version has the standard 12 column output). Sequence length data could always be obtained from Genbank and added into these results using other Galaxy tools (column join, etc.). Getting the query and match sequence lengths is even simpler that that with the BLAST+ wrappers - just select the extended tabular output. Of course, you'll need to adjust the downstream analysis to take into account the different column numbers. Peter ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-user] How to replace ensembl gene ID with gene names in Cuffdiff output?
Hi Wei, BioMart has tools to extract tabular data that maps Ensembl transcript identifiers to alternate identifiers, gene symbols, etc. See the tool under Get Data - BioMart Central server. You'll likely have to map from Ensembl transcriptID - HGNC transcript - HGNC gene http://uswest.ensembl.org/Homo_sapiens/Gene/Summary?db=core;g=ENSG0223972;r=1:11872-14412;t=ENST0515242 Other sources also have this type of data: http://www.genenames.org/data/hgnc_data.php?hgnc_id=37102 The Cuffdiff file will also need to be prepared, the Ensembl transcripts need to be in tabular format. 'Convert Delimiters to Tab' is the correct tool choice. Then, once both files have the data you wish to you in tabular format, join the data on common keys using tools in 'Join, Subtract and Group - Column Join'. To finish, use tools in 'Text Manipulation', 'Filter and Sort', and 'Join, Subtract and Group' to format the data so that it is useful for your purposes. If you need help, we have screencasts cover most of these text manipulation operations: Galaxy 101 + Tool tutorials (top 6) http://wiki.g2.bx.psu.edu/Learn/Screencasts Hopefully this helps, Jen Galaxy team On 4/19/12 1:43 PM, Wei Liao wrote: Hi all, I had the following Cuffdiff output from genes defferetial expression testing: test_id gene_id genelocus sample_1sample_2 status value_1 value_2 log2(fold_change) test_stat p_value q_value XLOC_01 XLOC_01 ENST0450305,ENST0456328,ENST0515242,ENST0518655 chr1:11868-31109NORMAL 1 OK 0.5587970.84004 0.588134-0.44598 0.6556110.767628 There are four transcript IDs belong to gene DDXL11L1, my question is how to replace these ID with official gene names? Thanks, -- Wei Liao Research Scientist, Brentwood Biomedical Research Institute 16111 Plummer St. Bldg 7, Rm D-122 North Hills, CA 91343 818-891-7711 ext 7645 ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ -- Jennifer Jackson http://galaxyproject.org ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-user] MegaBLAST output
Thanks Peter, Excellent point. From there, the Cut tool could be used to reorganize the output to exactly match that of the 13-column regular megablast output. So, no external data needed, no tool modifications needed. This can't be done on the main public Galaxy instance as BLAST+ is not available there, but for any local/cloud instance this is an alternative certainly work testing. Best, Jen Galaxy team On 4/24/12 2:36 PM, Peter Cock wrote: On Tue, Apr 24, 2012 at 10:24 PM, Jennifer Jacksonj...@bx.psu.edu wrote: ..., using the BLAST+ BLASTN megablast wrapper that Peter authored, in a local or cloud instance, would be the best immediate remedy (this version has the standard 12 column output). Sequence length data could always be obtained from Genbank and added into these results using other Galaxy tools (column join, etc.). Getting the query and match sequence lengths is even simpler that that with the BLAST+ wrappers - just select the extended tabular output. Of course, you'll need to adjust the downstream analysis to take into account the different column numbers. Peter -- Jennifer Jackson http://galaxyproject.org ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-user] Cuffdiff
Hi Ateequr, This post from today has information another member found at seqanswers.com, directly from the CuffLinks/Merge/Diff tool author: http://user.list.galaxyproject.org/Re-1-cuffcompare-or-cuffmerge-td4581029.html Best, Jen Galaxy team On 4/17/12 8:00 AM, Ateequr Rehman wrote: Dear All I have simple and question for cuffdiff should we run cuffdif on merge transcript file (produced by cuffmerge) and concatenate data sets or directly on cufflink produced files, in the later case, i have two transcript files resulting from cufflink on sample 1 and 2 respectively, result using sample 1 as transcripts are not the same when i am suing sample 2 as transcript i am bit confused what should be the correct way any help is very much welcomed Best ateeq Ateequr Rehman House No. 2 ground floor Blauenstr. 10 79115 Freiburg im Breisgau ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ -- Jennifer Jackson http://galaxyproject.org ___ The Galaxy User list should be used for the discussion of Galaxy analysis and other features on the public server at usegalaxy.org. Please keep all replies on the list by using reply all in your mail client. For discussion of local Galaxy instances and the Galaxy source code, please use the Galaxy Development list: http://lists.bx.psu.edu/listinfo/galaxy-dev To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/