[galaxy-dev] Uploading and outputting multiple files
Hi List, Our tools require an array of input files and also produce an array of files. This seems as a pretty standard task, but unfortunately Galaxy doesn't support this yet. I wonder if somebody has already implemented this kind of 'file array' data type? I.e. the user selects a local directory and uploads all files as a single dataset. Even better if the user could specify a regular expression for file selection. I know that there are other options, i.e. taring the files or using composite data type, but they don't seem natural. Any hints? ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] pass more information on a dataset merge
Hi John, I tried your galaxy-central-homogeneous-composite-datatypes implementation, works great thank you (and Jorrit). A couple of fixes: 1. Add multi_upload.xml to too_conf.xml 2. lib/galaxy/tools/parameters/grouping.py line 322 (in get_filenames( context )) - if ftp_files is not None: Remove is not None as ftp_files is empty [], but not None, then line 331 user_ftp_dir = os.path.join( trans.app.config.ftp_upload_dir, trans.user.email ) throws an exeption if ftp_upload_dir isn't set. Alex -Original Message- From: galaxy-dev-boun...@lists.bx.psu.edu [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of John Chilton Sent: Tuesday, 16 October 2012 1:07 AM To: Jorrit Boekel Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] pass more information on a dataset merge Here is an implementation of the implicit multi-file composite datatypes piece of that idea. I think the implicit parallelism may be harder. https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite-datatypes/compare Jorrit do you have any objection to me trying to get this included in galaxy-central (this is 95% code I stole from you)? I made the changes against a clean galaxy-central fork and included nothing proteomics specific in anticipation of trying to do that. I have talked with Jim Johnson about the idea and he believes it would be useful his mothur metagenomics tools, so the idea is valuable outside of proteomics. Galaxy team, would you be okay with including this and if so is there anything you would like to see either at a high level or at the level of the actual implementation. -John John Chilton Senior Software Developer University of Minnesota Supercomputing Institute Office: 612-625-0917 Cell: 612-226-9223 Bitbucket: https://bitbucket.org/jmchilton Github: https://github.com/jmchilton Web: http://jmchilton.net On Mon, Oct 8, 2012 at 9:24 AM, John Chilton chil...@msi.umn.edu wrote: Jim Johnson and I have been discussing that approach to handling fractionated proteomics samples as well (composite datatypes, not the specifics of the interface for parallelizing). My perspective has been that Galaxy should be augmented with better native mechanisms for grouping objects in histories, operating over those groups, building workflows that involve arbitrary numbers of inputs, etc... Composite data types are kindof a kludge, I think they are more useful for grouping HTML files together when you don't care about operating on the constituent parts you just want to view pages a as a report or something. With this proteomic data we are working with, the individual pieces are really interesting right? You want to operate on the individual pieces with the full array of tools (not just these special tools that have the logic for dealing with the composite datatypes), you want to visualize the files, etc... Putting these component pieces in the composite data type extra_files path really limits what you can do with the pieces in Galaxy. I have a vague idea of something that I think could bridge some of the gaps between the approaches (though I have no clue on the feasibility). Looking through your implementation on bitbucket it looks like you are defining your core datatypes (MS2, CruxSequest) as subclasses of this composite data type (CompositeMultifile). My recommendation would be to try to define plain datatypes for these core datatype (MS2, CruxSequest) and then have the separate composite datatype sort of delegate to the plain datatypes. You could then continue to explicitly declare subclasses of the composite datatype (maybe MS2Set, CruxSequestSet), but also maybe augement the tool xml so you can do implicit data type instances the way you can with tabular data for instance (instead of defining columns you would define the datatype to delegate to). The next step would be to make the parallelism implicit (i.e pull it out of the tool wrapper). Your tool wrappers wouldn't reference the composite datatypes, they would reference the simple datatypes, but you could add a little icon next to any input that let you replace a single input with a composite input for that type. It would be kind of like the run workflow page where you can replace an input with a multiple inputs. If a composite input (or inputs) are selected the tool would then produce composite outputs. For the steps that actually combine multiple inputs, I think in your case this is perculator maybe (a tool like interprophet or Scaffold that merges peptide probabilities across runs and groups proteins), then you could have the same sort of implicit replacement but instead of for single inputs it could do that for multi-inputs (assuming the Galaxy powers that be accept my fixes for multi-input tool parameters:
Re: [galaxy-dev] Number of outputs = number of inputs
I tried galaxy-central-homogeneous-composite-datatypes fork, works great. I have a similar problem, where number of output files varies, it seems that your approach might work for output files as well (not only input). Currently I'm trying to work out how to implement it, any help is appreciated. Alex -Original Message- From: galaxy-dev-boun...@lists.bx.psu.edu [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of John Chilton Sent: Wednesday, 17 October 2012 12:49 AM To: Sascha Kastens Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Number of outputs = number of inputs I don't believe this is possible in Galaxy right now. Are the outputs independent or is information from all inputs used to produce all outputs? If they are independent, you can create a workflow containing just your tool with 1 input and 1 output and use the batch workflow mode to run it on multiple files and get multiple outputs. This is not a beautiful solution but it gets the job done in some cases. Another thing to look at might be the discussion we are having on the thread pass more information on a dataset merge. We have a fork (its all work from Jorrit Boekel) of galaxy that creates composite datatypes for each explicitly defined type that can hold collections of a single type. https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite-datatypes/compare This would hopefully let you declare that you can accept a collection of whatever your input type is and produce a collection of whatever your output is. Lots of downsides to this approach - not fully implemented, and not included in Galaxy proper, your outputs would be wrapped up in a composite datatype so they wouldn't be easily processable by downstream tools. It would be good to have additional people hacking on it though :) -John John Chilton Senior Software Developer University of Minnesota Supercomputing Institute Office: 612-625-0917 Cell: 612-226-9223 Bitbucket: https://bitbucket.org/jmchilton Github: https://github.com/jmchilton Web: http://jmchilton.net On Tue, Oct 16, 2012 at 7:13 AM, Sascha Kastens s.kast...@gatc-biotech.com wrote: Hi all! I have a tool which takes one ore more input files. For each input file one output is created, i.e. 1 input file - 1 output file, 2 input files - 2 output files, etc. What is the best way to handle this? I used the directions for handlin multiple output files where the 'Number of Output datasets cannot be determined until tool run' which in my opinion is a bit inappropriate. BTW: The input files are added via the repeat-Tag, so maybe there is a similar thing for outputs? Thanks in advance! Cheers, Sascha ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Retreiving a library data type in command
$filename.ext From: galaxy-dev-boun...@lists.bx.psu.edu [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Simon Gladman Sent: Friday, 19 October 2012 4:38 PM To: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Retreiving a library data type in command oops, clicked send too early.. On 19 October 2012 16:33, Simon Gladman simon.glad...@monash.edumailto:simon.glad...@monash.edu wrote: Hi all, Have searched mailing list to no avail. Therefore: How can I get a data library's data_type in the command tagset? Example: command dosomething.plhttp://dosomething.pl filename.data_type /command inputs param type='data' format='fasta,fastq,sam,bam' name='filename' label='a data set from imported datasets...'/ /inputs So is the file of type: fasta, fastq, sam or bam and how can I get that into the command tag? Thanks in advance, Simon Gladman. ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] pass more information on a dataset merge
Hi John, what I don't get - I specify the output format m:grd, my tool generates multiple output files in dataset_id_files folder, but the dataset_id.dat file is empty. I need to call this regenerate_primary_file() to add the HTML with the file list to the dat file. But I'm not sure where? -Alex -Original Message- From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of John Chilton Sent: Friday, 19 October 2012 6:16 AM To: Khassapov, Alex (CSIRO IMT, Clayton) Subject: Re: [galaxy-dev] pass more information on a dataset merge On Tue, Oct 16, 2012 at 11:11 PM, alex.khassa...@csiro.au wrote: Hi John, I am definitely interested in this idea, not only me - we are currently working on moving a few scientific tools (not related to genome) into cloud using Galaxy. Great. My interests in Galaxy are mostly outside of genomics as well, it is good to have more people utilizing Galaxy in this way because it will force the platform to become more generic and address more broader use cases. We will try it further and see if we need any changes. For now one improvement would be nice, make dataset_id.dat contain list of paths to the location of the uploaded files, so by displaying html page the user could just click on the link and download the file. Code that attempted to do this was in there, but didn't work obviously. I have now fixed it up. Thanks for beta testing. -John We are pretty new to Galaxy, so our understanding of Galaxy is pretty limited. Thanks again, Alex -Original Message- From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of John Chilton Sent: Wednesday, 17 October 2012 3:21 AM To: Khassapov, Alex (CSIRO IMT, Clayton) Subject: Re: [galaxy-dev] pass more information on a dataset merge Wow, thanks for the rapid feedback! I have made the changes you have suggested. It seems you must be interested in this idea/implementation. Let me know if you have specific use cases/requirements in mind and/or if you would be interested in write access to the repository. -John On Mon, Oct 15, 2012 at 11:51 PM, alex.khassa...@csiro.au wrote: Hi John, I tried your galaxy-central-homogeneous-composite-datatypes implementation, works great thank you (and Jorrit). A couple of fixes: 1. Add multi_upload.xml to too_conf.xml 2. lib/galaxy/tools/parameters/grouping.py line 322 (in get_filenames( context )) - if ftp_files is not None: Remove is not None as ftp_files is empty [], but not None, then line 331 user_ftp_dir = os.path.join( trans.app.config.ftp_upload_dir, trans.user.email ) throws an exeption if ftp_upload_dir isn't set. Alex -Original Message- From: galaxy-dev-boun...@lists.bx.psu.edu [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of John Chilton Sent: Tuesday, 16 October 2012 1:07 AM To: Jorrit Boekel Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] pass more information on a dataset merge Here is an implementation of the implicit multi-file composite datatypes piece of that idea. I think the implicit parallelism may be harder. https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite-da t atypes/compare Jorrit do you have any objection to me trying to get this included in galaxy-central (this is 95% code I stole from you)? I made the changes against a clean galaxy-central fork and included nothing proteomics specific in anticipation of trying to do that. I have talked with Jim Johnson about the idea and he believes it would be useful his mothur metagenomics tools, so the idea is valuable outside of proteomics. Galaxy team, would you be okay with including this and if so is there anything you would like to see either at a high level or at the level of the actual implementation. -John John Chilton Senior Software Developer University of Minnesota Supercomputing Institute Office: 612-625-0917 Cell: 612-226-9223 Bitbucket: https://bitbucket.org/jmchilton Github: https://github.com/jmchilton Web: http://jmchilton.net On Mon, Oct 8, 2012 at 9:24 AM, John Chilton chil...@msi.umn.edu wrote: Jim Johnson and I have been discussing that approach to handling fractionated proteomics samples as well (composite datatypes, not the specifics of the interface for parallelizing). My perspective has been that Galaxy should be augmented with better native mechanisms for grouping objects in histories, operating over those groups, building workflows that involve arbitrary numbers of inputs, etc... Composite data types are kindof a kludge, I think they are more useful for grouping HTML files together when you don't care about operating on the constituent parts you just want to view pages a as a report or something. With this proteomic data we are working with, the individual pieces are really interesting right? You want to operate on the individual pieces
Re: [galaxy-dev] pass more information on a dataset merge
1) One more question, My colleague likes the idea, but his composite data set dataset_id.dat file contains only a plain list of uploaded files, not HTML like yours. I was wondering if it is possible to pass somehow a parameter to CompositeMultifile.regenerate_primary_file(dataset) to switch between HTML and plain list formats. I mean add a 'hidden' parameter in toll.xml file, but I'm not sure how to get these tool parameters in Galaxy source? 2) And one more question, I use your m:xxx format for the tool output, all files are generated in the dataset_id_files folder, but the dataset_id.dat file is empty. To force creation of the dataset.id file I use the exec_after_process hook (code file=.py/ tag): for key,val in out_data.items(): try: if not hasattr(val.dataset, name): val.dataset.name = val.dataset.file_name val.datatype.regenerate_primary_file(val.dataset) except Exception as e: print ERROR: + str(e) But it doesn't feel right, I wonder what is the proper way to use m:xxx format for the output? -Alex -Original Message- From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of John Chilton Sent: Saturday, 20 October 2012 1:40 AM To: Khassapov, Alex (CSIRO IMT, Clayton) Subject: Re: [galaxy-dev] pass more information on a dataset merge Hey Alex, I think the idea here is that your initially uploaded files would have different names, but after Jorrit's tool split/merge step they will all just be named after the dataset id (see screenshot) so you need the task_X at the end so they don't all just have the same name. I have not thought a whole lot about the naming thing, in general it seems like a tough problem and one that Galaxy itself doesn't do a particularly good job at. Jorrit have you given any thought to this? I wonder if it would be feasible to use the initial uploaded name as a sort of prefix going forward. So if I upload say fraction1.RAW fraction2.RAW fraction3.RAW and run a conversion step, maybe I could get: fraction1_dataset567.ms2 fraction2_dataset567.ms2 fraction3_dataset567.ms2 instead of dataset567.dat_task_0 dataset567.dat_task_1 dataset567.dat_task_2 Jorrit do you mind if I give implementing that a shot? It seems like it would be a win to me. Am I am going to hit some problem I don't see now (presumable we have to send some data from the split to the merge and that might be tricky)? -John On Thu, Oct 18, 2012 at 7:00 PM, alex.khassa...@csiro.au wrote: Thanks John, I wonder what's the reason for appending _task_XX to the file names, why can't we just keep original file names? Alex -Original Message- From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of John Chilton Sent: Friday, 19 October 2012 6:16 AM To: Khassapov, Alex (CSIRO IMT, Clayton) Subject: Re: [galaxy-dev] pass more information on a dataset merge On Tue, Oct 16, 2012 at 11:11 PM, alex.khassa...@csiro.au wrote: Hi John, I am definitely interested in this idea, not only me - we are currently working on moving a few scientific tools (not related to genome) into cloud using Galaxy. Great. My interests in Galaxy are mostly outside of genomics as well, it is good to have more people utilizing Galaxy in this way because it will force the platform to become more generic and address more broader use cases. We will try it further and see if we need any changes. For now one improvement would be nice, make dataset_id.dat contain list of paths to the location of the uploaded files, so by displaying html page the user could just click on the link and download the file. Code that attempted to do this was in there, but didn't work obviously. I have now fixed it up. Thanks for beta testing. -John We are pretty new to Galaxy, so our understanding of Galaxy is pretty limited. Thanks again, Alex -Original Message- From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of John Chilton Sent: Wednesday, 17 October 2012 3:21 AM To: Khassapov, Alex (CSIRO IMT, Clayton) Subject: Re: [galaxy-dev] pass more information on a dataset merge Wow, thanks for the rapid feedback! I have made the changes you have suggested. It seems you must be interested in this idea/implementation. Let me know if you have specific use cases/requirements in mind and/or if you would be interested in write access to the repository. -John On Mon, Oct 15, 2012 at 11:51 PM, alex.khassa...@csiro.au wrote: Hi John, I tried your galaxy-central-homogeneous-composite-datatypes implementation, works great thank you (and Jorrit). A couple of fixes: 1. Add multi_upload.xml to too_conf.xml 2. lib/galaxy/tools/parameters/grouping.py line 322 (in get_filenames( context )) - if ftp_files is not None: Remove is not None as ftp_files is empty [], but not None, then line 331 user_ftp_dir = os.path.join(
Re: [galaxy-dev] determination of errors
By default Galaxy checks stderr, if it's not empty - returns an error. So if your tool doesn't fail (returns 0) but you print something to stderr , your tool will still fail in Galaxy. There's stderr_wrapper.py workaround for that. On the other hand, if you tool returns non zero but doesn't use stderr -- Galaxy ignores tools return value. There are two ways around that: 1. Galaxy has exit_code tag to specify which exit codes to handle http://wiki.g2.bx.psu.edu/Admin/Tools/Tool%20Config%20Syntax#A.3Cexit_code.3E_tag_set So in my tool.xml I have: stdio exit_code range=1:255 level=fatal description=XLICTRecon.exe Exception / /stdio 2. Simple workaround in the Python wrapper, print something to stderr if the tool returns an error: returncode = subprocess.call(cmd) if(returncode): sys.stderr.write('Error: returned ' + str(returncode)) -Alex -Original Message- From: galaxy-dev-boun...@lists.bx.psu.edu [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Peter Cock Sent: Tuesday, 23 October 2012 2:30 AM To: David Hoover Cc: Galaxy Dev Subject: Re: [galaxy-dev] determination of errors On Mon, Oct 22, 2012 at 4:23 PM, David Hoover hoove...@helix.nih.gov wrote: How does Galaxy determine that a job has failed? It now depends on the individual tool's XML file. Does it simply see if the STDERR is empty? Why default, yes. The tool's XML can specify particular regexs to look for, or to decide based on the return code - but for the time being most of the tools still just look at stderr. See: http://wiki.g2.bx.psu.edu/Admin/Tools/Tool%20Config%20Syntax What happens if an application normally outputs to STDERR? Either use the new functionality in the XML definition, or what older Galaxy tools did was a wrapper script to hide/redirect stderr to avoid false positives. This is a problem for our local installation, as I have enabled it to run as the local user on the backend cluster. If a user has an error in the .bashrc file, it will automatically write to STDERR, and all jobs, no matter what, are labelled as failing. In which case the user should see those errors and be able to do something about it, right? Peter ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Source code documentation
API seems a bit of overkill, as I understand, it's useful for 'external' access via http. My tools run inside Galaxy and I should be able to use Python code directly. From: Anthonius deBoer [mailto:thondeb...@me.com] Sent: Tuesday, 23 October 2012 12:16 PM To: Khassapov, Alex (CSIRO IMT, Clayton) Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] Source code documentation The API allows you to do some of that... If you pass it the ID of the object (input.id) you can do all kinds of requests with the API. Look in the scripts/api folder of your local Galaxy instance... NOTE: The API seems to be a bit of a stepchild, since there is no good documentation and it seems to be undeveloped to some extent. For instance, the biggest issues is that you cannot pass a workflow any parameters, only inputs and outputs... So caveat emptor! Regards, Thon de Boer, Ph.D. Bioinformatics Guru +1-650-799-6839 thondeb...@me.commailto:thondeb...@me.com LinkedIn Profilehttp://www.linkedin.com/pub/thon-de-boer/1/1ba/a5b On Oct 22, 2012, at 5:26 PM, alex.khassa...@csiro.aumailto:alex.khassa...@csiro.au wrote: Hi, I wonder if there's some kind of documentation (reference) for the Galaxy source? At the moment I have a couple of questions for example. 1. How can I get the dataset object (in my Python wrapper) given the dataset name? 2. How can I access the job parameters (enered in the UI or 'hidden') in the Python code? In general, when I have this kind of questions, where do I look? -Alex ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Validation / validator questions
This can be done in validate(), see http://wiki.g2.bx.psu.edu/Admin/Tools/Custom%20Code Alex From: galaxy-dev-boun...@lists.bx.psu.edu [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Lukasse, Pieter Sent: Tuesday, 23 October 2012 9:05 PM To: galaxy-dev@lists.bx.psu.edu Subject: [galaxy-dev] Validation / validator questions Hi, Is it possible to do validation of input fields depending on the value of OTHER input fields? Example 1 If field A is filled, then field B also should be filled. Example 2 Field B should be field A etc Pieter Lukasse Wageningen UR, Plant Research International Departments of Bioscience and Bioinformatics Wageningen Campus, Building 107, Droevendaalsesteeg 1, 6708 PB, Wageningen, the Netherlands +31-317480891; skype: pieter.lukasse.wur http://www.pri.wur.nlhttp://www.pri.wur.nl/ ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Fwd: pass more information on a dataset merge
Hi John, Do you think it's possible to create a test for your 'm: format? I couldn't find how to specify multi input files for the test. -Alex -Original Message- From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of John Chilton Sent: Tuesday, 23 October 2012 7:59 AM To: Jorrit Boekel Cc: Khassapov, Alex (CSIRO IMT, Clayton) Subject: Re: Fwd: [galaxy-dev] pass more information on a dataset merge Hello again Jorrit, Great, I am glad we are largely on the same page here. I don't know when I will get a chance to look at this particular aspect, if you get there first that will be great, if not I will get there eventually. -John On Mon, Oct 22, 2012 at 2:51 AM, Jorrit Boekel jorrit.boe...@scilifelab.se wrote: IIRC, I implemented the task_X suffix (galaxy does so as well but to the split subdirectories) to ensure jobs that contained multiple split datasets would be run in sync. Files from two datasets that belong together then get analysed together in subsequent steps. It would however be much nicer to retain original file names through a pipeline or at least the possibility to retrieve them. Since the split/merge now run actively look and match for files with identical 'task_x', it may be an option to do: fraction1.raw - fraction1.raw_dataset_43.dat_task_0 - fraction1.raw_dataset_44.dat_task_0 fraction2.raw - fraction2.raw_dataset_43.dat_task_1 - fraction2.raw_dataset_44.dat_task_1 (Note that python starts counting at 0, while most researchers number their first fraction 1.) I wouldn't mind looking more into that as well, since it would be a big improvement UI-wise. cheers, jorrit On 10/19/2012 04:40 PM, John Chilton wrote: Jorrit I meant to cc you on this response to Alex. -- Forwarded message -- From: John Chilton chil0...@umn.edu Date: Fri, Oct 19, 2012 at 9:40 AM Subject: Re: [galaxy-dev] pass more information on a dataset merge To: alex.khassa...@csiro.au Hey Alex, I think the idea here is that your initially uploaded files would have different names, but after Jorrit's tool split/merge step they will all just be named after the dataset id (see screenshot) so you need the task_X at the end so they don't all just have the same name. I have not thought a whole lot about the naming thing, in general it seems like a tough problem and one that Galaxy itself doesn't do a particularly good job at. Jorrit have you given any thought to this? I wonder if it would be feasible to use the initial uploaded name as a sort of prefix going forward. So if I upload say fraction1.RAW fraction2.RAW fraction3.RAW and run a conversion step, maybe I could get: fraction1_dataset567.ms2 fraction2_dataset567.ms2 fraction3_dataset567.ms2 instead of dataset567.dat_task_0 dataset567.dat_task_1 dataset567.dat_task_2 Jorrit do you mind if I give implementing that a shot? It seems like it would be a win to me. Am I am going to hit some problem I don't see now (presumable we have to send some data from the split to the merge and that might be tricky)? -John On Thu, Oct 18, 2012 at 7:00 PM, alex.khassa...@csiro.au wrote: Thanks John, I wonder what's the reason for appending _task_XX to the file names, why can't we just keep original file names? Alex -Original Message- From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of John Chilton Sent: Friday, 19 October 2012 6:16 AM To: Khassapov, Alex (CSIRO IMT, Clayton) Subject: Re: [galaxy-dev] pass more information on a dataset merge On Tue, Oct 16, 2012 at 11:11 PM, alex.khassa...@csiro.au wrote: Hi John, I am definitely interested in this idea, not only me - we are currently working on moving a few scientific tools (not related to genome) into cloud using Galaxy. Great. My interests in Galaxy are mostly outside of genomics as well, it is good to have more people utilizing Galaxy in this way because it will force the platform to become more generic and address more broader use cases. We will try it further and see if we need any changes. For now one improvement would be nice, make dataset_id.dat contain list of paths to the location of the uploaded files, so by displaying html page the user could just click on the link and download the file. Code that attempted to do this was in there, but didn't work obviously. I have now fixed it up. Thanks for beta testing. -John We are pretty new to Galaxy, so our understanding of Galaxy is pretty limited. Thanks again, Alex -Original Message- From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of John Chilton Sent: Wednesday, 17 October 2012 3:21 AM To: Khassapov, Alex (CSIRO IMT, Clayton) Subject: Re: [galaxy-dev] pass more information on a dataset merge Wow, thanks for the rapid feedback! I have made the changes you have suggested. It seems you must be interested in this idea/implementation. Let me know if you have specific use
Re: [galaxy-dev] pass more information on a dataset merge
Hi John, My colleague (Neil) has a bit of a problem with the multi file support: When I try and use the option Upload Directory of files I get the error below Error Traceback: View as: Interactive | Text | XML (full) ⇝ AttributeError: 'Bunch' object has no attribute 'multifiles' URL: http://140.253.78.218/library_common/upload_library_dataset Module weberror.evalexception.middleware:364 in respond view app_iter = self.application(environ, detect_start_response) Module paste.debug.prints:98 in __call__ view environ, self.app) Module paste.wsgilib:539 in intercept_output view app_iter = application(environ, replacement_start_response) Module paste.recursive:80 in __call__ view return self.application(environ, start_response) Module paste.httpexceptions:632 in __call__ view return self.application(environ, start_response) Module galaxy.web.framework.base:160 in __call__ view body = method( trans, **kwargs ) Module galaxy.web.controllers.library_common:855 in upload_library_dataset view **kwd ) Module galaxy.web.controllers.library_common:1055 in upload_dataset view json_file_path = upload_common.create_paramfile( trans, uploaded_datasets ) Module galaxy.tools.actions.upload_common:342 in create_paramfile view multifiles = uploaded_dataset.multifiles, AttributeError: 'Bunch' object has no attribute 'multifiles' Any ideas? Should we check if 'multifiles' attribute is set? Or some other call is missing which should set it to NULL if it's missing? -Alex -Original Message- From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of John Chilton Sent: Wednesday, 17 October 2012 3:21 AM To: Khassapov, Alex (CSIRO IMT, Clayton) Subject: Re: [galaxy-dev] pass more information on a dataset merge Wow, thanks for the rapid feedback! I have made the changes you have suggested. It seems you must be interested in this idea/implementation. Let me know if you have specific use cases/requirements in mind and/or if you would be interested in write access to the repository. -John On Mon, Oct 15, 2012 at 11:51 PM, alex.khassa...@csiro.au wrote: Hi John, I tried your galaxy-central-homogeneous-composite-datatypes implementation, works great thank you (and Jorrit). A couple of fixes: 1. Add multi_upload.xml to too_conf.xml 2. lib/galaxy/tools/parameters/grouping.py line 322 (in get_filenames( context )) - if ftp_files is not None: Remove is not None as ftp_files is empty [], but not None, then line 331 user_ftp_dir = os.path.join( trans.app.config.ftp_upload_dir, trans.user.email ) throws an exeption if ftp_upload_dir isn't set. Alex -Original Message- From: galaxy-dev-boun...@lists.bx.psu.edu [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of John Chilton Sent: Tuesday, 16 October 2012 1:07 AM To: Jorrit Boekel Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] pass more information on a dataset merge Here is an implementation of the implicit multi-file composite datatypes piece of that idea. I think the implicit parallelism may be harder. https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite-dat atypes/compare Jorrit do you have any objection to me trying to get this included in galaxy-central (this is 95% code I stole from you)? I made the changes against a clean galaxy-central fork and included nothing proteomics specific in anticipation of trying to do that. I have talked with Jim Johnson about the idea and he believes it would be useful his mothur metagenomics tools, so the idea is valuable outside of proteomics. Galaxy team, would you be okay with including this and if so is there anything you would like to see either at a high level or at the level of the actual implementation. -John John Chilton Senior Software Developer University of Minnesota Supercomputing Institute Office: 612-625-0917 Cell: 612-226-9223 Bitbucket: https://bitbucket.org/jmchilton Github: https://github.com/jmchilton Web: http://jmchilton.net On Mon, Oct 8, 2012 at 9:24 AM, John Chilton chil...@msi.umn.edu wrote: Jim Johnson and I have been discussing that approach to handling fractionated proteomics samples as well (composite datatypes, not the specifics of the interface for parallelizing). My perspective has been that Galaxy should be augmented with better native mechanisms for grouping objects in histories, operating over those groups, building workflows that involve arbitrary numbers of inputs, etc... Composite data types are kindof a kludge, I think they are more useful for grouping HTML files together when you don't care about operating on the constituent parts you just want to view pages a as a report or something. With this proteomic data we are working with, the individual pieces are really interesting right? You want to operate on the
[galaxy-dev] FW: pass more information on a dataset merge
Thanks John, works fine. -Alex -Original Message- From: Burdett, Neil (ICT Centre, Herston - RBWH) Sent: Tuesday, 4 December 2012 9:57 AM To: Khassapov, Alex (CSIRO IMT, Clayton) Cc: Szul, Piotr (ICT Centre, Marsfield) Subject: RE: [galaxy-dev] pass more information on a dataset merge Thanks Alex, seems to work now so I checked in the code to our repository Neil From: jmchil...@gmail.com [jmchil...@gmail.com] On Behalf Of John Chilton [chil0...@umn.edu] Sent: Tuesday, December 04, 2012 4:26 AM To: Khassapov, Alex (CSIRO IMT, Clayton) Cc: Burdett, Neil (ICT Centre, Herston - RBWH); Szul, Piotr (ICT Centre, Marsfield); galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] pass more information on a dataset merge Hey Alex, Until I have bullied this stuff into galaxy-central, you should probably e-mail me directly and not the dev list. That said thanks for the heads up, that there was a definitely a bug. I pushed out this changeset to the bitbucket repository: https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite-datatypes/commits/d501e8a2e3fafca139f1187ee947ae425a75eb2c/raw/ I should mention that I have sort of abandoned the bitbucket repository for this work in lieu of github, so that I can rebase as Galaxy changes and keep clean changesets. https://github.com/jmchilton/galaxy-central/tree/multifiles Since I am posting this on the mailing list I might as well post a little summary of what has been done: - For each datatype, an implicit multiple file version of that datatype is created. A new multiple upload tool/ftp directory tool has been implemented to create these. - For any simple tool input you can chose a multiple file version of that input instead and then all outputs will become multiple file versions of the outputs. Uses task splitting stuff to distribute jobs across files. - For multiple input tools, you can choose either multiple inputs individuals (no change there) or a single composite version. Consistent interface for file path, display name, extension, etc... in tool wrapper. - It should work with most existing tools and datatypes without change. - Everything enabled with a single option in universe.ini Upshots: - Makes workflows with arbitrary merging (and to a lesser extent branching) and arbitrary number of input files possible. - Original base name is saved throughout analysis (when possible), so sample/replicate/fraction/lane/etc tracking is easier. I started working on the metadata piece last night, once that is done I was planning on making a little demo video to post to this list to try to sell the 3 outstanding small pull requests related to this work and the massive one that would follow those up :). -John On Sun, Dec 2, 2012 at 8:52 PM, alex.khassa...@csiro.au wrote: Hi John, My colleague (Neil) has a bit of a problem with the multi file support: When I try and use the option Upload Directory of files I get the error below Error Traceback: View as: Interactive | Text | XML (full) ⇝ AttributeError: 'Bunch' object has no attribute 'multifiles' URL: http://140.253.78.218/library_common/upload_library_dataset Module weberror.evalexception.middleware:364 in respond view app_iter = self.application(environ, detect_start_response) Module paste.debug.prints:98 in __call__ view environ, self.app) Module paste.wsgilib:539 in intercept_output view app_iter = application(environ, replacement_start_response) Module paste.recursive:80 in __call__ view return self.application(environ, start_response) Module paste.httpexceptions:632 in __call__ view return self.application(environ, start_response) Module galaxy.web.framework.base:160 in __call__ view body = method( trans, **kwargs ) Module galaxy.web.controllers.library_common:855 in upload_library_dataset view **kwd ) Module galaxy.web.controllers.library_common:1055 in upload_dataset view json_file_path = upload_common.create_paramfile( trans, uploaded_datasets ) Module galaxy.tools.actions.upload_common:342 in create_paramfile view multifiles = uploaded_dataset.multifiles, AttributeError: 'Bunch' object has no attribute 'multifiles' Any ideas? Should we check if 'multifiles' attribute is set? Or some other call is missing which should set it to NULL if it's missing? -Alex -Original Message- From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of John Chilton Sent: Wednesday, 17 October 2012 3:21 AM To: Khassapov, Alex (CSIRO IMT, Clayton) Subject: Re: [galaxy-dev] pass more information on a dataset merge Wow, thanks for the rapid feedback! I have made the changes you have suggested. It seems you must be interested in this idea/implementation. Let me know if you have specific use cases/requirements in mind and/or if you would be interested in write access to the
Re: [galaxy-dev] card 79: Split large jobs over multiple nodes for processing
Hi All, Can anybody please add a few words on how can we use the “initial implementation” which “ exists in the tasks framework”? -Alex From: Trello [mailto:do-not-re...@trello.com] Sent: Wednesday, 6 February 2013 10:58 AM To: Khassapov, Alex (CSIRO IMT, Clayton) Subject: 4 new notifications on the board Galaxy: Development since 5:56 PM (Tuesday) [https://trello.com/images/logo-s.png] Notifications On Galaxy: Developmenthttps://trello.com/board/galaxy-development/506338ce32ae458f6d15e4b3 [https://trello-avatars.s3.amazonaws.com/a6e93a63989ab71cd87ade0165a04b08/30.png]James Taylor added [https://trello-avatars.s3.amazonaws.com/d0f1bba8eb293d305140421271c383a9/30.png] Dannon Baker to the card 79: Split large jobs over multiple nodes for processinghttps://trello.com/card/79-split-large-jobs-over-multiple-nodes-for-processing/506338ce32ae458f6d15e4b3/411 on Galaxy: Developmenthttps://trello.com/board/galaxy-development/506338ce32ae458f6d15e4b3 [https://trello-avatars.s3.amazonaws.com/a6e93a63989ab71cd87ade0165a04b08/30.png]James Taylor commented on the card 79: Split large jobs over multiple nodes for processinghttps://trello.com/card/79-split-large-jobs-over-multiple-nodes-for-processing/506338ce32ae458f6d15e4b3/411 on Galaxy: Developmenthttps://trello.com/board/galaxy-development/506338ce32ae458f6d15e4b3 An initial implementation exists in the tasks framework. [https://trello-avatars.s3.amazonaws.com/a6e93a63989ab71cd87ade0165a04b08/30.png]James Taylor moved the card 79: Split large jobs over multiple nodes for processinghttps://trello.com/card/79-split-large-jobs-over-multiple-nodes-for-processing/506338ce32ae458f6d15e4b3/411 to Complete on Galaxy: Developmenthttps://trello.com/board/galaxy-development/506338ce32ae458f6d15e4b3 [https://trello-avatars.s3.amazonaws.com/a6e93a63989ab71cd87ade0165a04b08/30.png]James Taylor moved the card 137: allow multiple=true in input param fields of type datahttps://trello.com/card/137-allow-multiple-true-in-input-param-fields-of-type-data/506338ce32ae458f6d15e4b3/292 to Pull Requests / Patches on Galaxy: Developmenthttps://trello.com/board/galaxy-development/506338ce32ae458f6d15e4b3 Change how often you get email on your account pagehttps://trello.com/my/account. Follow Trello on Twitterhttps://twitter.com/intent/follow?user_id=360831528 and Facebookhttps://www.facebook.com/TrelloApp. Get the Trello app for iPhonehttp://itunes.com/apps/trello or Androidhttps://play.google.com/store/apps/details?id=com.trello. ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] card 79: Split large jobs over multiple nodes for processing
Thanks Peter. I see, parallelism works on a single large file by splitting it and using multiple instances to process the bits in parallel. In our case we use 'composite' data type, simply an array of input files and we would like to process them in parallel, instead of having a 'foreach' loop in the tool wrapper. Is it possible? We are looking at CloudMan for creating a cluster in Galaxy now. -Alex -Original Message- From: Peter Cock [mailto:p.j.a.c...@googlemail.com] Sent: Thursday, 7 February 2013 9:09 PM To: Khassapov, Alex (CSIRO IMT, Clayton) Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] card 79: Split large jobs over multiple nodes for processing On Wed, Feb 6, 2013 at 11:43 PM, alex.khassa...@csiro.au wrote: Hi All, Can anybody please add a few words on how can we use the initial implementation which exists in the tasks framework? -Alex To enable this, set use_tasked_jobs = True in your universe_wsgi.ini file. The tools must also be configured to allow this via the parallelism tag. Many of my tools do this, for example see the NCBI BLAST+ wrappers in the tool shed. Additionally the data file formats must support being split, or being merged - which is done via Python code in the Galaxy datatype definition (see the split and merge methods in lib/galaxy/datatypes/*.py). Some other relevant Python code is in lib/galaxy/jobs/splitters/*.py Peter ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/
Re: [galaxy-dev] Composite datatype output for Cuffdiff
Hi Dannon, I understand that instead of having one dataset with multiple files you are planning to use existing datasets and combine them in a 'collection'. My concerns are: 1. Our data consists of 200-8000 files, can you imagine how many datasets we'll end up with? It will be a mess. 2. All these files in a dataset belong to each other and it doesn't make much sense to keep them separately. 3. For performance reasons, all these files are located in a single directory which makes it easier to iterate over. 4. From my point of view, it makes perfect sense to have a concept of a dataset with multiple files, you have already a dataset_xxx_files folder anyway, and it's not a big change comparing to the new concept of collection 5. We are already using the m:xxx type datasets (thanks John) in our project, I guess you don't even have a timeframe for implementing the collection concept? I'm sure that for many projects using multi file datasets is a requirement now, not in 'years' time. 6. Collection is also a good idea and I guess they both can exist together, but only in the future, given current users an opportunity to use Galaxy for their needs. Otherwise we simply have to look at other frameworks which already support multi file datasets. -Alex From: Dannon Baker [mailto:dannon.ba...@gmail.com] Sent: Tuesday, 5 March 2013 1:09 AM To: Khassapov, Alex (CSIRO IMT, Clayton) Cc: chil...@msi.umn.edu; galaxy-...@bx.psu.edu; NeCTAR Cloud Imaging Project Team Subject: Re: [galaxy-dev] Composite datatype output for Cuffdiff Alex, To reiterate what Jeremy has already said on the mailing list, this is definitely something we want, and need, for Galaxy. While this particular implementation has a lot of good parts, creating these collections as first-class composite datasets isn't ideal and we'd be stuck supporting them going forward, forever. There's a clear plan for implementing this in Trello (https://trello.com/c/325AXIEr), most of which is straightforward to implement. The 'hard' part is really going to be implementing an ideal UI for dealing with these collections, something which we could do in phases. What exactly are your concerns with the implementation as set out in the Trello card? -Dannon On Mon, Mar 4, 2013 at 1:32 AM, alex.khassa...@csiro.aumailto:alex.khassa...@csiro.au wrote: Yeah John, This is sad, I don't understand why it is such a problem? If it's already implemented and used in real projects like ours - then it is needed for the community. I don't think we have other options for our requirements, your multiple file datasets implementation was a real saviour for us. -Alex -Original Message- From: jmchil...@gmail.commailto:jmchil...@gmail.com [mailto:jmchil...@gmail.commailto:jmchil...@gmail.com] On Behalf Of John Chilton Sent: Monday, 4 March 2013 4:42 PM To: Khassapov, Alex (CSIRO IMT, Clayton) Cc: galaxy-...@bx.psu.edumailto:galaxy-...@bx.psu.edu Subject: Re: [galaxy-dev] Composite datatype output for Cuffdiff Hi Alex, Thanks for the comments. The galaxy team has made it clear here and to me privately that this will NOT be included in the Galaxy main code base. I hope and am I confident that they will make grouping datasets work, hopefully even to thousands of files. I do not believe the two ideas are mutually exclusive and I will be maintaining a fork of galaxy-central with these additions, I will set this up this week hopefully. I will do my best to respond to support requests and make multiple file datasets and composite types in general as robust as possible, keep up with Galaxy updates, etc Obviously, it is risky to let a code base drift so far from galaxy main's however and you, me, and others who might want to use them will have to carefully weigh the risks when determining if multiple file datasets are worth the headache. Thanks for all your help and inputs. I am sorry this did not turn out differently, I feel I have really failed here. -John On Sun, Mar 3, 2013 at 10:08 PM, alex.khassa...@csiro.aumailto:alex.khassa...@csiro.au wrote: Hi John, Are you saying that composite multiple file dataset isn't required and won't be implemented? We are using your implementation of multifiles dataset (m:xxx type) and hope that eventually it will be pushed into main Galaxy implementation. As we are using Galaxy for CT reconstruction tools, where input and output can consist of a couple thousand files, other options are not feasible, i.e. grouping datasets. -Alex -Original Message- From: galaxy-dev-boun...@lists.bx.psu.edumailto:galaxy-dev-boun...@lists.bx.psu.edu [mailto:galaxy-dev-boun...@lists.bx.psu.edumailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of John Chilton Sent: Thursday, 28 February 2013 2:06 AM To: Jeremy Goecks Cc: Jim Johnson; galaxy-...@bx.psu.edumailto:galaxy-...@bx.psu.edu Subject: Re: [galaxy-dev] Composite datatype output for Cuffdiff Hey Jeremy, I am
Re: [galaxy-dev] Multi File upload api
Hi John, Can you please have a look at Neil's question. Thank you, -Alex From: Burdett, Neil (ICT Centre, Herston - RBWH) Sent: Thursday, 30 May 2013 4:30 PM To: Khassapov, Alex (CSIRO IMT, Clayton) Subject: Multi File upload api Hi Alex, The file galaxy-dist/scripts/api/example_watch_folder.py allows us to watch a folder then upload files that arrive in a specified input folder into the database and history. Can you ask your friend (who implemented the multi file upload tool), what changes are necessary to this file so we can upload multiple files as we do from the GUI. I assume he would know quite quickly what to do and hopefully quite simple Thanks Neil ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
[galaxy-dev] Creating workflow which includes Multifile upload
Hi John, One more problem with multifile upload - when I display a workflow which includes multi upload tool, I get: Module workflow_run_mako:476 in render_row_for_param http://140.79.7.98/workflow/run?id=f597429621d6eb2b __M_writer(unicode(param.get_label())) AttributeError: 'UploadDataset' object has no attribute 'get_label' Ok, I see that UploadDataset class is derived from Group, not ToolParameter. So I tried to add get_label() to the Group class, which returns some string. But then I get: Module workflow_run_mako:476 in render_row_for_param http://140.79.7.98/workflow/run?id=f597429621d6eb2b __M_writer(unicode(param.get_label())) TypeError: 'str' object is not callable Here my knowledge of Galaxy ends and I need some help please. -Alex ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-dev] Creating workflow which includes Multifile upload
Hi Peter, Of course I added def get_label(self), as a matter of fact, I copied get_label() from ToolParameter class. That's why I'm a bit confused. The get_label function returns a string which is supposed to be displayed, but instead something is trying to execute it? Best Regards, Alex Khassapov Software Engineer CSIRO IMT From: Peter Cock [p.j.a.c...@googlemail.com] Sent: Wednesday, 5 June 2013 7:36 PM To: Khassapov, Alex (CSIRO IMT, Clayton) Cc: chil...@msi.umn.edu; galaxy-...@bx.psu.edu; NeCTAR Cloud Imaging Project Team Subject: Re: [galaxy-dev] Creating workflow which includes Multifile upload On Wed, Jun 5, 2013 at 8:56 AM, alex.khassa...@csiro.au wrote: Hi John, One more problem with multifile upload – when I display a workflow which includes multi upload tool, I get: Module workflow_run_mako:476 in render_row_for_param __M_writer(unicode(param.get_label())) AttributeError: 'UploadDataset' object has no attribute 'get_label' Ok, I see that UploadDataset class is derived from Group, not ToolParameter. So I tried to add get_label() to the Group class, which returns some string. But then I get: Module workflow_run_mako:476 in render_row_for_param __M_writer(unicode(param.get_label())) TypeError: 'str' object is not callable Here my knowledge of Galaxy ends and I need some help please. Hi Alex, I guess from the Python exception that you didn't create a method called get_label, but a property or attribute perhaps? Try this at the python prompt and you'll get the same TypeError: hello() Traceback (most recent call last): File stdin, line 1, in module TypeError: 'str' object is not callable I would have added a get_label method to the class using something like this: class UploadDataset(...): def get_label(self): return Uploaded stuff Peter ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
[galaxy-dev] Appending _task_%d suffix to multi files
Hi guys, We've been using Galaxy for a year now, we created our own Galaxy fork where we were making changes to adapt Galaxy to our requirements. As we need multiple file dataset - we were using Johns' fork for that initially. Now we are trying to use The most updated version of the multiple file dataset stuff https://bitbucket.org/msiappdev/galaxy-extras/ directly as we don't want to maintain our own version. One of the problems we have - when we upload multiple files - their file names are changed (_task_%d suffix is added to their names). On our branch we simply removed the code which does it, but now we wonder if it is possible to avoid this renaming somehow? I.e. make it configurable? Is it really necessary to change the file names? -Alex -Original Message- From: galaxy-dev-boun...@lists.bx.psu.edu [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Jorrit Boekel Sent: Thursday, 25 October 2012 8:35 PM To: Peter Cock Cc: galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] the multi job splitter I keep the files matched by keeping a _task_%d suffix to their names. So each task is matched with its correct counterpart with the same number. cheers, jorrit ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: http://lists.bx.psu.edu/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-dev] Appending _task_%d suffix to multi files
Hi Piotr, Regarding data parallelism - Galaxy can split a single large file into small parts and process them in parallel, then merge outputs into single file. That's not what we need, as we already have multiple input files. But as I understand, there's a possibility to write our own splitters/mergers to fit our requirements. And yeah, Jorrit - enjoy your holidays! -Alex From: Jorrit Boekel [mailto:jorrit.boe...@scilifelab.se] Sent: Thursday, 1 August 2013 7:45 PM To: Szul, Piotr (ICT Centre, Marsfield) Cc: Khassapov, Alex (CSIRO IMT, Clayton); p.j.a.c...@googlemail.com; jmchil...@gmail.com; galaxy-dev@lists.bx.psu.edu; Burdett, Neil (ICT Centre, Herston - RBWH) Subject: Re: Appending _task_%d suffix to multi files Hi Piotr, In our proteomics lab, a protein sample is fractionated (by e.g. pH) before analysis in a nr of sample fractions. The fractions are then run through the mass spectrometer one at a time. Each fraction yields a data file. The mass spec data is then matched to peptides by searching a FASTA file, termed target, with protein sequences. Afterwards the matches are statistically scored by machine learning. To do this, the data is also matched with a scrambled FASTA file, termed decoy. Each fraction is matched to a target and decoy file, which yields two match-files per fraction. The machine learning tool thus picks a target and a decoy matchfile and puts statistical significances on the matches. In order for this to be correct, it needs to pick matchfiles that correspond, ie that are derived from the same fraction. In our lab, we have not yet looked at John Chilton's (I think) work with the m: data sets, and our parallel processing is done inside galaxy, using its split and merge functions to divide a job into tasks. Each task is sent as a separate job to sge, I think, but others may know more about this than I. I really have to get back to my holiday now, cheers, jorrit On 08/01/2013 04:17 AM, piotr.s...@csiro.aumailto:piotr.s...@csiro.au wrote: Hi Jorrit, Thank you for your explanation. Would you be able to give us an example of what do you mean by fractions and when the task_%d are being used to pick files. Just want to make sure we have good understanding of the problem that you solved. Also, I vaguely remember seeing 'data parallelism mentioned somewhere with relation to the m: data sets. Do you currently support in any way automatic distribution of processing of such datasets to parallel environments (e.g. array jobs in sge or such?) Cheers, - Piotr From: Jorrit Boekel [mailto:jorrit.boe...@scilifelab.se] Sent: Wednesday, July 31, 2013 8:18 PM To: Khassapov, Alex (CSIRO IMT, Clayton) Cc: p.j.a.c...@googlemail.commailto:p.j.a.c...@googlemail.com; jmchil...@gmail.commailto:jmchil...@gmail.com; galaxy-dev@lists.bx.psu.edumailto:galaxy-dev@lists.bx.psu.edu; Szul, Piotr (ICT Centre, Marsfield); Burdett, Neil (ICT Centre, Herston - RBWH) Subject: Re: Appending _task_%d suffix to multi files Hi Alex, In our lab, files are often fractions of an experiments, but they are named by their creators in whatever way they like. I put that code in to standardize fraction naming, in case a tool needs input from two files that originate from the same fraction (but have been treated in different ways). In those cases, in my fork, Galaxy always picks the files with the same task_%d numbers. I can't help you very much right now, as I'm currently away from work until October, but I hope this explains why its in there. cheers, jorrit On 07/31/2013 04:15 AM, alex.khassa...@csiro.aumailto:alex.khassa...@csiro.au wrote: Hi guys, We've been using Galaxy for a year now, we created our own Galaxy fork where we were making changes to adapt Galaxy to our requirements. As we need multiple file dataset - we were using Johns' fork for that initially. Now we are trying to use The most updated version of the multiple file dataset stuff https://bitbucket.org/msiappdev/galaxy-extras/ directly as we don't want to maintain our own version. One of the problems we have - when we upload multiple files - their file names are changed (_task_%d suffix is added to their names). On our branch we simply removed the code which does it, but now we wonder if it is possible to avoid this renaming somehow? I.e. make it configurable? Is it really necessary to change the file names? -Alex -Original Message- From: galaxy-dev-boun...@lists.bx.psu.edumailto:galaxy-dev-boun...@lists.bx.psu.edu [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Jorrit Boekel Sent: Thursday, 25 October 2012 8:35 PM To: Peter Cock Cc: galaxy-dev@lists.bx.psu.edumailto:galaxy-dev@lists.bx.psu.edu Subject: Re: [galaxy-dev] the multi job splitter I keep the files matched by keeping a _task_%d suffix to their names. So each task is matched with its correct counterpart with the same number. cheers, jorrit