Re: [galaxy-dev] Dataset Collections status

2015-08-19 Thread John Chilton
On Fri, Aug 7, 2015 at 8:01 PM, Keith Suderman suder...@cs.vassar.edu wrote:
 Greetings,

 I started pulling Galaxy code from the dev branch a few months ago to take 
 advantage of the (then just emerging) dataset collections feature.  However, 
 it is not clear to me from the latest release notes if the data collections 
 are now fully merged into master, or if I should continue to use the code in 
 the dev branch to take advantage of bleeding edge code.  I would like to move 
 back to the master branch as soon as feasible.

It is an ongoing effort - but the master branch as of now contains
essentially everything in the dev branch
https://github.com/galaxyproject/galaxy/tree/master. I need to put
together some release notes for 15.07 before there can be an
announcement of that but there is a few new collection related things
in the release. In some senses though collections have been fully
usable for over a year - and in some senses there is a lot of work
left to do. Kind of depends on what you are doing.


 When running workflows over dataset collections I will frequently see errors 
 like:

 /bin/sh: 1: 
 /home/galaxy/galaxy_old/database/job_working_directory/001/1216/galaxy_1216.sh:
  Text file busy

I don't think this is related to collections per se - I think it is
probably more a file system problem - are you using a local job runner
or a cluster manager? Is the file system mounted over a slow NFS
connection?


 Which, from what I can tell, occurs when one process is trying to 
 modify/delete a file open in another process.  While the error seems to be 
 repeatable, it also seems random as the errors do not occur in the same 
 places if I run the workflow multiple times.

 Given that I am working from the dev branch I don't want to open/raise issues 
 on features still in development.  But if this is unexpected then I can do 
 some more investigating and file a proper bug report.

I would say report bugs in dev always - maybe check the existing ones
on github and Trello first - but ideally we would like to catch bugs
as early as possible and we don't usually commit half-baked code to
dev - it should be bug free (though maybe missing features).

Hope this helps,
-John


 Cheers,
 Keith

 --
 Research Associate
 Department of Computer Science
 Vassar College
 Poughkeepsie, NY


 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
   https://lists.galaxyproject.org/

 To search Galaxy mailing lists use the unified search at:
   http://galaxyproject.org/search/mailinglists/
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Dataset Collections status

2015-08-19 Thread Suderman Keith
Hi John,

I will try what we have against master.  I just went through my old emails and 
it looks like the developer branch was recommended in response to a UI issue I 
experienced with large dataset collections and not the collections themselves.  
For reference, the UI issue occurred when I inadvertently created 4K history 
items and jQuery kept timing out trying to update all the checkboxes being 
created.

The Text file busy error occurred on my developer machine (OS X 10.9.5, 
Python 2.7.9) with no job runner, cluster manager, or NFS.  I will run more 
tests and file proper bug reports for both issues if I can still recreate them.

Cheers
Keith

On Aug 19, 2015, at 9:48 AM, John Chilton jmchil...@gmail.com wrote:

 On Fri, Aug 7, 2015 at 8:01 PM, Keith Suderman suder...@cs.vassar.edu wrote:
 Greetings,
 
 I started pulling Galaxy code from the dev branch a few months ago to take 
 advantage of the (then just emerging) dataset collections feature.  However, 
 it is not clear to me from the latest release notes if the data collections 
 are now fully merged into master, or if I should continue to use the code in 
 the dev branch to take advantage of bleeding edge code.  I would like to 
 move back to the master branch as soon as feasible.
 
 It is an ongoing effort - but the master branch as of now contains
 essentially everything in the dev branch
 https://github.com/galaxyproject/galaxy/tree/master. I need to put
 together some release notes for 15.07 before there can be an
 announcement of that but there is a few new collection related things
 in the release. In some senses though collections have been fully
 usable for over a year - and in some senses there is a lot of work
 left to do. Kind of depends on what you are doing.
 
 
 When running workflows over dataset collections I will frequently see errors 
 like:
 
 /bin/sh: 1: 
 /home/galaxy/galaxy_old/database/job_working_directory/001/1216/galaxy_1216.sh:
  Text file busy
 
 I don't think this is related to collections per se - I think it is
 probably more a file system problem - are you using a local job runner
 or a cluster manager? Is the file system mounted over a slow NFS
 connection?
 
 
 Which, from what I can tell, occurs when one process is trying to 
 modify/delete a file open in another process.  While the error seems to be 
 repeatable, it also seems random as the errors do not occur in the same 
 places if I run the workflow multiple times.
 
 Given that I am working from the dev branch I don't want to open/raise 
 issues on features still in development.  But if this is unexpected then I 
 can do some more investigating and file a proper bug report.
 
 I would say report bugs in dev always - maybe check the existing ones
 on github and Trello first - but ideally we would like to catch bugs
 as early as possible and we don't usually commit half-baked code to
 dev - it should be bug free (though maybe missing features).
 
 Hope this helps,
 -John
 
 
 Cheers,
 Keith
 
 --
 Research Associate
 Department of Computer Science
 Vassar College
 Poughkeepsie, NY
 
 
 ___
 Please keep all replies on the list by using reply all
 in your mail client.  To manage your subscriptions to this
 and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/
 
 To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

[galaxy-dev] dataset collections

2015-01-21 Thread Jorrit Boekel
Hi all,

I’m toying around a little in galaxy-dist with the dataset collections feature. 
Since I know this is work in progress, I was wondering about some things I 
haven’t really found online.

It seems to work really well to run a tool on a list of datasets, and a new job 
is run for each list item. But when I want to reduce to a smaller amount of 
list items, I understand I need to write some sort of merge tool myself, 
dependent on the data (all proteomics data here currently). This works well for 
reducing a dataset to a single file, but I am not sure about how to reduce to a 
new smaller collection. In the tool I’m writing, I let the user choose the size 
of the collection.

Is there some way to tell galaxy dynamically how many outputs to expect AND put 
them in a collection? Something like:
outputs
output type=“data_collection” amount_of_files=“3”/
/outputs
Where 3 is set by the user in a param also.


Also, when running with two or more lists as input, is there some sort of 
correlation between the lists? It seems like it takes the files in dataset no 
order, so just checking.

By the way, thanks very much John and everyone else involved in collections for 
doing and pushing this stuff. If there are smaller issues I can help with, I’d 
be thrilled. Can’t stress enough how much this feature means for galaxy 
adoption in our lab and possibly field.

cheers,
— 
Jorrit Boekel
Proteomics systems developer
BILS / Lehtiö lab
Scilifelab Stockholm, Sweden



___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  https://lists.galaxyproject.org/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] dataset collections

2015-01-21 Thread John Chilton
On Wed, Jan 21, 2015 at 11:02 AM, Jorrit Boekel
jorrit.boe...@scilifelab.se wrote:
 Hi all,

 I’m toying around a little in galaxy-dist with the dataset collections 
 feature. Since I know this is work in progress, I was wondering about some 
 things I haven’t really found online.

 It seems to work really well to run a tool on a list of datasets, and a new 
 job is run for each list item. But when I want to reduce to a smaller amount 
 of list items, I understand I need to write some sort of merge tool myself, 
 dependent on the data (all proteomics data here currently). This works well 
 for reducing a dataset to a single file, but I am not sure about how to 
 reduce to a new smaller collection. In the tool I’m writing, I let the user 
 choose the size of the collection.

 Is there some way to tell galaxy dynamically how many outputs to expect AND 
 put them in a collection? Something like:
 outputs
 output type=“data_collection” amount_of_files=“3”/
 /outputs
 Where 3 is set by the user in a param also.

Through the January 2015 release - this is not possible. It is now
possible in central for tools to explicitly produce collections and
this functionality will be in the next release (I think it is still an
open question as to whether the team is aiming for February or March
2015 for that). There are a lot of details and examples linked to in
the following pull request that I merged last week:

https://bitbucket.org/galaxy/galaxy-central/pull-request/634/allow-tools-to-explicitly-produce-dataset

There are three style of outputs of increasing complexity - tools that
produce static collection (pairs or lists of fixed size), N-N
operations (like normalization), and finally fully dynamic
collections. Since one can pre-determine the size this would ideally
fall under the first type but I had not foreseen this use case so
there is currently no syntax like you described so you have to use the
third most complicated style.

I am going to assume this is some sort of binning operation? Lets
imagine - the user selects 3 bins and your tools creates a directory
and populates it with mzml files :

output/bin1.mzml
output/bin2.mzml
output/bin3.mzml

Then you could create an output collection like this:

outputs
  collection name=binned_output type=list label=Binned Outputs
discover_datasets pattern=__name_and_ext__ directory=outputs /
  /collection
/outputs

After the job is complete a dataset collection will be populated with
three elements of type mzml and element identifiers bin1, bin2, and
bin3 (inferred from the name).

The syntax of the discover_datasets thing can be quite complex and you
can variously hard code properties like datatype, name, etc... or
infer them from the name on the file system.



 Also, when running with two or more lists as input, is there some sort of 
 correlation between the lists? It seems like it takes the files in dataset no 
 order, so just checking.

Yes definitely. The UI for creating lists is pretty limited and needs
to be updated to look a lot more like the UI for creating lists of
paired datasets and this would become a lot more clear I think. So
lists are ordered data structures and element identifiers are
preserved across executions.

So if you start with a list of raw files with identifiers sample1,
sample2, and sample3 and map and operation  like peak picking over
them you would get a new list with the same order and identifiers
(sample1, sample2, and sample3 in that order), then if you map and
operation like peptide identification on the picked files again you
would get a list with identifiers (sample1, sample2, and sample3).
Then if you have some sort of summary operation that takes in a raw
file and identification and you pass it the original list and the
result of the identifications - Galaxy should match everything up and
assign the create a resulting list with sample1, sample2, and sample3.
(The API lets you do more complicated things like cross-products over
subsets of inputs - but this isn't exposed in the GUI yet).

If you have two ordered lists and identifiers don't match up -
Galaxy's behavior should be considered undefined but it will assign
the resulting list the identifiers from one or the other inputs.


 By the way, thanks very much John and everyone else involved in collections 
 for doing and pushing this stuff.

Thanks - and it would be just a playground for me to write API tests
against without all the excellent work Carl has put into the UI to
make everything actually usable and useful.

 If there are smaller issues I can help with, I’d be thrilled.

The number one thing I encourage everyone to do to help is to build
awesome tools and workflows and put them in the tool shed.

 Can’t stress enough how much this feature means for galaxy adoption in our 
 lab and possibly field.

Shhh... don't tell them I am secretly wasting money they would like to
put into building a platform for sequencing data analysis to address
mass spec use cases - they will