Re: [galaxy-dev] Dataset Collections status
On Fri, Aug 7, 2015 at 8:01 PM, Keith Suderman suder...@cs.vassar.edu wrote: Greetings, I started pulling Galaxy code from the dev branch a few months ago to take advantage of the (then just emerging) dataset collections feature. However, it is not clear to me from the latest release notes if the data collections are now fully merged into master, or if I should continue to use the code in the dev branch to take advantage of bleeding edge code. I would like to move back to the master branch as soon as feasible. It is an ongoing effort - but the master branch as of now contains essentially everything in the dev branch https://github.com/galaxyproject/galaxy/tree/master. I need to put together some release notes for 15.07 before there can be an announcement of that but there is a few new collection related things in the release. In some senses though collections have been fully usable for over a year - and in some senses there is a lot of work left to do. Kind of depends on what you are doing. When running workflows over dataset collections I will frequently see errors like: /bin/sh: 1: /home/galaxy/galaxy_old/database/job_working_directory/001/1216/galaxy_1216.sh: Text file busy I don't think this is related to collections per se - I think it is probably more a file system problem - are you using a local job runner or a cluster manager? Is the file system mounted over a slow NFS connection? Which, from what I can tell, occurs when one process is trying to modify/delete a file open in another process. While the error seems to be repeatable, it also seems random as the errors do not occur in the same places if I run the workflow multiple times. Given that I am working from the dev branch I don't want to open/raise issues on features still in development. But if this is unexpected then I can do some more investigating and file a proper bug report. I would say report bugs in dev always - maybe check the existing ones on github and Trello first - but ideally we would like to catch bugs as early as possible and we don't usually commit half-baked code to dev - it should be bug free (though maybe missing features). Hope this helps, -John Cheers, Keith -- Research Associate Department of Computer Science Vassar College Poughkeepsie, NY ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-dev] Dataset Collections status
Hi John, I will try what we have against master. I just went through my old emails and it looks like the developer branch was recommended in response to a UI issue I experienced with large dataset collections and not the collections themselves. For reference, the UI issue occurred when I inadvertently created 4K history items and jQuery kept timing out trying to update all the checkboxes being created. The Text file busy error occurred on my developer machine (OS X 10.9.5, Python 2.7.9) with no job runner, cluster manager, or NFS. I will run more tests and file proper bug reports for both issues if I can still recreate them. Cheers Keith On Aug 19, 2015, at 9:48 AM, John Chilton jmchil...@gmail.com wrote: On Fri, Aug 7, 2015 at 8:01 PM, Keith Suderman suder...@cs.vassar.edu wrote: Greetings, I started pulling Galaxy code from the dev branch a few months ago to take advantage of the (then just emerging) dataset collections feature. However, it is not clear to me from the latest release notes if the data collections are now fully merged into master, or if I should continue to use the code in the dev branch to take advantage of bleeding edge code. I would like to move back to the master branch as soon as feasible. It is an ongoing effort - but the master branch as of now contains essentially everything in the dev branch https://github.com/galaxyproject/galaxy/tree/master. I need to put together some release notes for 15.07 before there can be an announcement of that but there is a few new collection related things in the release. In some senses though collections have been fully usable for over a year - and in some senses there is a lot of work left to do. Kind of depends on what you are doing. When running workflows over dataset collections I will frequently see errors like: /bin/sh: 1: /home/galaxy/galaxy_old/database/job_working_directory/001/1216/galaxy_1216.sh: Text file busy I don't think this is related to collections per se - I think it is probably more a file system problem - are you using a local job runner or a cluster manager? Is the file system mounted over a slow NFS connection? Which, from what I can tell, occurs when one process is trying to modify/delete a file open in another process. While the error seems to be repeatable, it also seems random as the errors do not occur in the same places if I run the workflow multiple times. Given that I am working from the dev branch I don't want to open/raise issues on features still in development. But if this is unexpected then I can do some more investigating and file a proper bug report. I would say report bugs in dev always - maybe check the existing ones on github and Trello first - but ideally we would like to catch bugs as early as possible and we don't usually commit half-baked code to dev - it should be bug free (though maybe missing features). Hope this helps, -John Cheers, Keith -- Research Associate Department of Computer Science Vassar College Poughkeepsie, NY ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/ ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
[galaxy-dev] dataset collections
Hi all, I’m toying around a little in galaxy-dist with the dataset collections feature. Since I know this is work in progress, I was wondering about some things I haven’t really found online. It seems to work really well to run a tool on a list of datasets, and a new job is run for each list item. But when I want to reduce to a smaller amount of list items, I understand I need to write some sort of merge tool myself, dependent on the data (all proteomics data here currently). This works well for reducing a dataset to a single file, but I am not sure about how to reduce to a new smaller collection. In the tool I’m writing, I let the user choose the size of the collection. Is there some way to tell galaxy dynamically how many outputs to expect AND put them in a collection? Something like: outputs output type=“data_collection” amount_of_files=“3”/ /outputs Where 3 is set by the user in a param also. Also, when running with two or more lists as input, is there some sort of correlation between the lists? It seems like it takes the files in dataset no order, so just checking. By the way, thanks very much John and everyone else involved in collections for doing and pushing this stuff. If there are smaller issues I can help with, I’d be thrilled. Can’t stress enough how much this feature means for galaxy adoption in our lab and possibly field. cheers, — Jorrit Boekel Proteomics systems developer BILS / Lehtiö lab Scilifelab Stockholm, Sweden ___ Please keep all replies on the list by using reply all in your mail client. To manage your subscriptions to this and other Galaxy lists, please use the interface at: https://lists.galaxyproject.org/ To search Galaxy mailing lists use the unified search at: http://galaxyproject.org/search/mailinglists/
Re: [galaxy-dev] dataset collections
On Wed, Jan 21, 2015 at 11:02 AM, Jorrit Boekel jorrit.boe...@scilifelab.se wrote: Hi all, I’m toying around a little in galaxy-dist with the dataset collections feature. Since I know this is work in progress, I was wondering about some things I haven’t really found online. It seems to work really well to run a tool on a list of datasets, and a new job is run for each list item. But when I want to reduce to a smaller amount of list items, I understand I need to write some sort of merge tool myself, dependent on the data (all proteomics data here currently). This works well for reducing a dataset to a single file, but I am not sure about how to reduce to a new smaller collection. In the tool I’m writing, I let the user choose the size of the collection. Is there some way to tell galaxy dynamically how many outputs to expect AND put them in a collection? Something like: outputs output type=“data_collection” amount_of_files=“3”/ /outputs Where 3 is set by the user in a param also. Through the January 2015 release - this is not possible. It is now possible in central for tools to explicitly produce collections and this functionality will be in the next release (I think it is still an open question as to whether the team is aiming for February or March 2015 for that). There are a lot of details and examples linked to in the following pull request that I merged last week: https://bitbucket.org/galaxy/galaxy-central/pull-request/634/allow-tools-to-explicitly-produce-dataset There are three style of outputs of increasing complexity - tools that produce static collection (pairs or lists of fixed size), N-N operations (like normalization), and finally fully dynamic collections. Since one can pre-determine the size this would ideally fall under the first type but I had not foreseen this use case so there is currently no syntax like you described so you have to use the third most complicated style. I am going to assume this is some sort of binning operation? Lets imagine - the user selects 3 bins and your tools creates a directory and populates it with mzml files : output/bin1.mzml output/bin2.mzml output/bin3.mzml Then you could create an output collection like this: outputs collection name=binned_output type=list label=Binned Outputs discover_datasets pattern=__name_and_ext__ directory=outputs / /collection /outputs After the job is complete a dataset collection will be populated with three elements of type mzml and element identifiers bin1, bin2, and bin3 (inferred from the name). The syntax of the discover_datasets thing can be quite complex and you can variously hard code properties like datatype, name, etc... or infer them from the name on the file system. Also, when running with two or more lists as input, is there some sort of correlation between the lists? It seems like it takes the files in dataset no order, so just checking. Yes definitely. The UI for creating lists is pretty limited and needs to be updated to look a lot more like the UI for creating lists of paired datasets and this would become a lot more clear I think. So lists are ordered data structures and element identifiers are preserved across executions. So if you start with a list of raw files with identifiers sample1, sample2, and sample3 and map and operation like peak picking over them you would get a new list with the same order and identifiers (sample1, sample2, and sample3 in that order), then if you map and operation like peptide identification on the picked files again you would get a list with identifiers (sample1, sample2, and sample3). Then if you have some sort of summary operation that takes in a raw file and identification and you pass it the original list and the result of the identifications - Galaxy should match everything up and assign the create a resulting list with sample1, sample2, and sample3. (The API lets you do more complicated things like cross-products over subsets of inputs - but this isn't exposed in the GUI yet). If you have two ordered lists and identifiers don't match up - Galaxy's behavior should be considered undefined but it will assign the resulting list the identifiers from one or the other inputs. By the way, thanks very much John and everyone else involved in collections for doing and pushing this stuff. Thanks - and it would be just a playground for me to write API tests against without all the excellent work Carl has put into the UI to make everything actually usable and useful. If there are smaller issues I can help with, I’d be thrilled. The number one thing I encourage everyone to do to help is to build awesome tools and workflows and put them in the tool shed. Can’t stress enough how much this feature means for galaxy adoption in our lab and possibly field. Shhh... don't tell them I am secretly wasting money they would like to put into building a platform for sequencing data analysis to address mass spec use cases - they will