Re: [galaxy-dev] pass more information on a dataset merge

2012-12-03 Thread John Chilton
Hey Alex,

Until I have bullied this stuff into galaxy-central, you should
probably e-mail me directly and not the dev list. That said thanks for
the heads up, that there was a definitely a bug. I pushed out this
changeset to the bitbucket repository:

https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite-datatypes/commits/d501e8a2e3fafca139f1187ee947ae425a75eb2c/raw/

I should mention that I have sort of abandoned the bitbucket
repository for this work in lieu of github, so that I can rebase as
Galaxy changes and keep clean changesets.

https://github.com/jmchilton/galaxy-central/tree/multifiles

Since I am posting this on the mailing list I might as well post a
little summary of what has been done:

- For each datatype, an implicit multiple file version of that
datatype is created. A new multiple upload tool/ftp directory tool has
been implemented to create these.
- For any simple tool input you can chose a multiple file version of
that input instead and then all outputs will become multiple file
versions of the outputs. Uses task splitting stuff to distribute jobs
across files.
- For multiple input tools, you can choose either multiple inputs
individuals (no change there) or a single composite version.
Consistent interface for file path, display name, extension, etc... in
tool wrapper.
- It should work with most existing tools and datatypes without change.
- Everything enabled with a single option in universe.ini

Upshots:
  - Makes workflows with arbitrary merging (and to a lesser extent
branching) and arbitrary number of input files possible.
  - Original base name is saved throughout analysis (when possible),
so sample/replicate/fraction/lane/etc tracking is easier.

I started working on the metadata piece last night, once that is done
I was planning on making a little demo video to post to this list to
try to sell the 3 outstanding small pull requests related to this work
and the massive one that would follow those up :).

-John


On Sun, Dec 2, 2012 at 8:52 PM,  alex.khassa...@csiro.au wrote:
 Hi John,

 My colleague (Neil) has a bit of a problem with the multi file support:

 When I try and use the option Upload Directory of files I get the error 
 below

 Error Traceback:
 View as:   Interactive  |  Text  |  XML (full)
 ⇝ AttributeError: 'Bunch' object has no attribute 'multifiles'
 URL: http://140.253.78.218/library_common/upload_library_dataset
 Module weberror.evalexception.middleware:364 in respond view
  app_iter = self.application(environ, detect_start_response)
 Module paste.debug.prints:98 in __call__ view
  environ, self.app)
 Module paste.wsgilib:539 in intercept_output view
  app_iter = application(environ, replacement_start_response)
 Module paste.recursive:80 in __call__ view
  return self.application(environ, start_response)
 Module paste.httpexceptions:632 in __call__ view
  return self.application(environ, start_response)
 Module galaxy.web.framework.base:160 in __call__ view
  body = method( trans, **kwargs )
 Module galaxy.web.controllers.library_common:855 in upload_library_dataset
  view
  **kwd )
 Module galaxy.web.controllers.library_common:1055 in upload_dataset 
 view
  json_file_path = upload_common.create_paramfile( trans, uploaded_datasets )
 Module galaxy.tools.actions.upload_common:342 in create_paramfile view
  multifiles = uploaded_dataset.multifiles,
 AttributeError: 'Bunch' object has no attribute 'multifiles'

 Any ideas? Should we check if 'multifiles' attribute is set? Or some other 
 call is missing which should set it to NULL if it's missing?

 -Alex

 -Original Message-
 From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of John 
 Chilton
 Sent: Wednesday, 17 October 2012 3:21 AM
 To: Khassapov, Alex (CSIRO IMT, Clayton)
 Subject: Re: [galaxy-dev] pass more information on a dataset merge

 Wow, thanks for the rapid feedback! I have made the changes you have 
 suggested. It seems you must be interested in this idea/implementation. Let 
 me know if you have specific use cases/requirements in mind and/or if you 
 would be interested in write access to the repository.

 -John

 On Mon, Oct 15, 2012 at 11:51 PM,  alex.khassa...@csiro.au wrote:
 Hi John,

 I tried your galaxy-central-homogeneous-composite-datatypes implementation, 
 works great thank you (and Jorrit).

 A couple of fixes:

 1. Add multi_upload.xml to too_conf.xml 2.
 lib/galaxy/tools/parameters/grouping.py line 322 (in get_filenames( context 
 )) -
 if ftp_files is not None:
Remove is not None as ftp_files is empty [], but not None, then line 
 331 user_ftp_dir = os.path.join( trans.app.config.ftp_upload_dir, 
 trans.user.email ) throws an exeption if ftp_upload_dir isn't set.

 Alex

 -Original Message-
 From: galaxy-dev-boun...@lists.bx.psu.edu
 [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of John Chilton
 Sent: Tuesday, 16 October 2012 1:07 AM
 To: Jorrit

Re: [galaxy-dev] pass more information on a dataset merge

2012-12-03 Thread Langhorst, Brad
John 

Yeah!

I'm glad you took the initiative to do this. It's one of the most requested 
features from local users here.
I'll happily test this in our environment.

Also - thanks for the github link - it's vastly superior to hg i think for the 
situation of merging pull requests. 
I can't figure a nice way to do a simple pull request in hg without a full 
scale repo duplication.


Best wishes!


Brad


On Dec 3, 2012, at 1:26 PM, John Chilton chil0...@umn.edu
 wrote:

 Hey Alex,
 
 Until I have bullied this stuff into galaxy-central, you should
 probably e-mail me directly and not the dev list. That said thanks for
 the heads up, that there was a definitely a bug. I pushed out this
 changeset to the bitbucket repository:
 
 https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite-datatypes/commits/d501e8a2e3fafca139f1187ee947ae425a75eb2c/raw/
 
 I should mention that I have sort of abandoned the bitbucket
 repository for this work in lieu of github, so that I can rebase as
 Galaxy changes and keep clean changesets.
 
 https://github.com/jmchilton/galaxy-central/tree/multifiles
 
 Since I am posting this on the mailing list I might as well post a
 little summary of what has been done:
 
 - For each datatype, an implicit multiple file version of that
 datatype is created. A new multiple upload tool/ftp directory tool has
 been implemented to create these.
 - For any simple tool input you can chose a multiple file version of
 that input instead and then all outputs will become multiple file
 versions of the outputs. Uses task splitting stuff to distribute jobs
 across files.
 - For multiple input tools, you can choose either multiple inputs
 individuals (no change there) or a single composite version.
 Consistent interface for file path, display name, extension, etc... in
 tool wrapper.
 - It should work with most existing tools and datatypes without change.
 - Everything enabled with a single option in universe.ini
 
 Upshots:
  - Makes workflows with arbitrary merging (and to a lesser extent
 branching) and arbitrary number of input files possible.
  - Original base name is saved throughout analysis (when possible),
 so sample/replicate/fraction/lane/etc tracking is easier.
 
 I started working on the metadata piece last night, once that is done
 I was planning on making a little demo video to post to this list to
 try to sell the 3 outstanding small pull requests related to this work
 and the massive one that would follow those up :).
 
 -John
 
 
 On Sun, Dec 2, 2012 at 8:52 PM,  alex.khassa...@csiro.au wrote:
 Hi John,
 
 My colleague (Neil) has a bit of a problem with the multi file support:
 
 When I try and use the option Upload Directory of files I get the error 
 below
 
 Error Traceback:
 View as:   Interactive  |  Text  |  XML (full)
 ⇝ AttributeError: 'Bunch' object has no attribute 'multifiles'
 URL: http://140.253.78.218/library_common/upload_library_dataset
 Module weberror.evalexception.middleware:364 in respond view
 app_iter = self.application(environ, detect_start_response)
 Module paste.debug.prints:98 in __call__ view
 environ, self.app)
 Module paste.wsgilib:539 in intercept_output view
 app_iter = application(environ, replacement_start_response)
 Module paste.recursive:80 in __call__ view
 return self.application(environ, start_response)
 Module paste.httpexceptions:632 in __call__ view
 return self.application(environ, start_response)
 Module galaxy.web.framework.base:160 in __call__ view
 body = method( trans, **kwargs )
 Module galaxy.web.controllers.library_common:855 in upload_library_dataset   
   view
 **kwd )
 Module galaxy.web.controllers.library_common:1055 in upload_dataset 
 view
 json_file_path = upload_common.create_paramfile( trans, uploaded_datasets )
 Module galaxy.tools.actions.upload_common:342 in create_paramfile 
 view
 multifiles = uploaded_dataset.multifiles,
 AttributeError: 'Bunch' object has no attribute 'multifiles'
 
 Any ideas? Should we check if 'multifiles' attribute is set? Or some other 
 call is missing which should set it to NULL if it's missing?
 
 -Alex
 
 -Original Message-
 From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of John 
 Chilton
 Sent: Wednesday, 17 October 2012 3:21 AM
 To: Khassapov, Alex (CSIRO IMT, Clayton)
 Subject: Re: [galaxy-dev] pass more information on a dataset merge
 
 Wow, thanks for the rapid feedback! I have made the changes you have 
 suggested. It seems you must be interested in this idea/implementation. Let 
 me know if you have specific use cases/requirements in mind and/or if you 
 would be interested in write access to the repository.
 
 -John
 
 On Mon, Oct 15, 2012 at 11:51 PM,  alex.khassa...@csiro.au wrote:
 Hi John,
 
 I tried your galaxy-central-homogeneous-composite-datatypes implementation, 
 works great thank you (and Jorrit).
 
 A couple of fixes:
 
 1. Add multi_upload.xml to too_conf.xml 2.
 lib

Re: [galaxy-dev] pass more information on a dataset merge

2012-12-02 Thread Alex.Khassapov
Hi John,

My colleague (Neil) has a bit of a problem with the multi file support:

When I try and use the option Upload Directory of files I get the error below

Error Traceback:
View as:   Interactive  |  Text  |  XML (full)
⇝ AttributeError: 'Bunch' object has no attribute 'multifiles'
URL: http://140.253.78.218/library_common/upload_library_dataset
Module weberror.evalexception.middleware:364 in respond view
  app_iter = self.application(environ, detect_start_response)
Module paste.debug.prints:98 in __call__ view
  environ, self.app)
Module paste.wsgilib:539 in intercept_output view
  app_iter = application(environ, replacement_start_response)
Module paste.recursive:80 in __call__ view
  return self.application(environ, start_response)
Module paste.httpexceptions:632 in __call__ view
  return self.application(environ, start_response)
Module galaxy.web.framework.base:160 in __call__ view
  body = method( trans, **kwargs )
Module galaxy.web.controllers.library_common:855 in upload_library_dataset  
   view
  **kwd )
Module galaxy.web.controllers.library_common:1055 in upload_dataset view
  json_file_path = upload_common.create_paramfile( trans, uploaded_datasets )
Module galaxy.tools.actions.upload_common:342 in create_paramfile view
  multifiles = uploaded_dataset.multifiles,
AttributeError: 'Bunch' object has no attribute 'multifiles'

Any ideas? Should we check if 'multifiles' attribute is set? Or some other call 
is missing which should set it to NULL if it's missing?

-Alex

-Original Message-
From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of John Chilton
Sent: Wednesday, 17 October 2012 3:21 AM
To: Khassapov, Alex (CSIRO IMT, Clayton)
Subject: Re: [galaxy-dev] pass more information on a dataset merge

Wow, thanks for the rapid feedback! I have made the changes you have suggested. 
It seems you must be interested in this idea/implementation. Let me know if you 
have specific use cases/requirements in mind and/or if you would be interested 
in write access to the repository.

-John

On Mon, Oct 15, 2012 at 11:51 PM,  alex.khassa...@csiro.au wrote:
 Hi John,

 I tried your galaxy-central-homogeneous-composite-datatypes implementation, 
 works great thank you (and Jorrit).

 A couple of fixes:

 1. Add multi_upload.xml to too_conf.xml 2. 
 lib/galaxy/tools/parameters/grouping.py line 322 (in get_filenames( context 
 )) -
 if ftp_files is not None:
Remove is not None as ftp_files is empty [], but not None, then line 331 
 user_ftp_dir = os.path.join( trans.app.config.ftp_upload_dir, 
 trans.user.email ) throws an exeption if ftp_upload_dir isn't set.

 Alex

 -Original Message-
 From: galaxy-dev-boun...@lists.bx.psu.edu
 [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of John Chilton
 Sent: Tuesday, 16 October 2012 1:07 AM
 To: Jorrit Boekel
 Cc: galaxy-dev@lists.bx.psu.edu
 Subject: Re: [galaxy-dev] pass more information on a dataset merge

 Here is an implementation of the implicit multi-file composite datatypes 
 piece of that idea. I think the implicit parallelism may be harder.

 https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite-dat
 atypes/compare

 Jorrit do you have any objection to me trying to get this included in 
 galaxy-central (this is 95% code I stole from you)? I made the changes 
 against a clean galaxy-central fork and included nothing proteomics specific 
 in anticipation of trying to do that. I have talked with Jim Johnson about 
 the idea and he believes it would be useful his mothur metagenomics tools, so 
 the idea is valuable outside of proteomics.

 Galaxy team, would you be okay with including this and if so is there 
 anything you would like to see either at a high level or at the level of the 
 actual implementation.

 -John

 
 John Chilton
 Senior Software Developer
 University of Minnesota Supercomputing Institute
 Office: 612-625-0917
 Cell: 612-226-9223
 Bitbucket: https://bitbucket.org/jmchilton
 Github: https://github.com/jmchilton
 Web: http://jmchilton.net

 On Mon, Oct 8, 2012 at 9:24 AM, John Chilton chil...@msi.umn.edu wrote:
 Jim Johnson and I have been discussing that approach to handling 
 fractionated proteomics samples as well (composite datatypes, not the 
 specifics of the interface for parallelizing).

 My perspective has been that Galaxy should be augmented with better 
 native mechanisms for grouping objects in histories, operating over 
 those groups, building workflows that involve arbitrary numbers of 
 inputs, etc... Composite data types are kindof a kludge, I think they 
 are more useful for grouping HTML files together when you don't care 
 about operating on the constituent parts you just want to view pages 
 a as a report or something. With this proteomic data we are working 
 with, the individual pieces are really interesting right? You want to 
 operate

Re: [galaxy-dev] pass more information on a dataset merge

2012-10-22 Thread Alex.Khassapov
1) One more question, 

My colleague likes the idea, but his composite data set dataset_id.dat file 
contains only a plain list of uploaded files, not HTML like yours.

I was wondering if it is possible to pass somehow a parameter to 
CompositeMultifile.regenerate_primary_file(dataset) to switch between HTML and 
plain list formats. I mean add a 'hidden' parameter in toll.xml file, but I'm 
not sure how to get these tool parameters in Galaxy source?

2) And one more question, I use your m:xxx format for the tool output, all 
files are generated in the dataset_id_files folder, but the dataset_id.dat file 
is empty. To force creation of the dataset.id file I use the exec_after_process 
hook (code file=.py/ tag):
for key,val in out_data.items():
try:
if not hasattr(val.dataset, name):
val.dataset.name = val.dataset.file_name
val.datatype.regenerate_primary_file(val.dataset)
except Exception as e:
print  ERROR:  + str(e)

But it doesn't feel right, I wonder what is the proper way to use m:xxx 
format for the output?

-Alex

-Original Message-
From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of John Chilton
Sent: Saturday, 20 October 2012 1:40 AM
To: Khassapov, Alex (CSIRO IMT, Clayton)
Subject: Re: [galaxy-dev] pass more information on a dataset merge

Hey Alex,

I think the idea here is that your initially uploaded files would have 
different names, but after Jorrit's tool split/merge step they will all just be 
named after the dataset id (see screenshot) so you need the task_X at the end 
so they don't all just have the same name.

I have not thought a whole lot about the naming thing, in general it seems like 
a tough problem and one that Galaxy itself doesn't do a particularly good job 
at.

Jorrit have you given any thought to this?

I wonder if it would be feasible to use the initial uploaded name as a sort of 
prefix going forward. So if I upload say

fraction1.RAW
fraction2.RAW
fraction3.RAW

and run a conversion step, maybe I could get:

fraction1_dataset567.ms2
fraction2_dataset567.ms2
fraction3_dataset567.ms2

instead of

dataset567.dat_task_0
dataset567.dat_task_1
dataset567.dat_task_2

Jorrit do you mind if I give implementing that a shot? It seems like it would 
be a win to me. Am I am going to hit some problem I don't see now (presumable 
we have to send some data from the split to the merge and that might be tricky)?

-John

On Thu, Oct 18, 2012 at 7:00 PM,  alex.khassa...@csiro.au wrote:
 Thanks John,

 I wonder what's the reason for appending _task_XX to the file names, why 
 can't we just keep original file names?

 Alex

 -Original Message-
 From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of 
 John Chilton
 Sent: Friday, 19 October 2012 6:16 AM
 To: Khassapov, Alex (CSIRO IMT, Clayton)
 Subject: Re: [galaxy-dev] pass more information on a dataset merge

 On Tue, Oct 16, 2012 at 11:11 PM,  alex.khassa...@csiro.au wrote:
 Hi John,

 I am definitely interested in this idea, not only me - we are currently 
 working on moving a few scientific tools (not related to genome) into cloud 
 using Galaxy.

 Great. My interests in Galaxy are mostly outside of genomics as well, it is 
 good to have more people utilizing Galaxy in this way because it will force 
 the platform to become more generic and address more broader use cases.


 We will try it further and see if we need any changes. For now one 
 improvement would be nice, make dataset_id.dat contain list of paths to the 
 location of the uploaded files, so by displaying html page the user could 
 just click on the link and download the file.


 Code that attempted to do this was in there, but didn't work obviously. I 
 have now fixed it up.

 Thanks for beta testing.

 -John

 We are pretty new to Galaxy, so our understanding of Galaxy is pretty 
 limited.

 Thanks again,

 Alex


 -Original Message-
 From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of 
 John Chilton
 Sent: Wednesday, 17 October 2012 3:21 AM
 To: Khassapov, Alex (CSIRO IMT, Clayton)
 Subject: Re: [galaxy-dev] pass more information on a dataset merge

 Wow, thanks for the rapid feedback! I have made the changes you have 
 suggested. It seems you must be interested in this idea/implementation. Let 
 me know if you have specific use cases/requirements in mind and/or if you 
 would be interested in write access to the repository.

 -John

 On Mon, Oct 15, 2012 at 11:51 PM,  alex.khassa...@csiro.au wrote:
 Hi John,

 I tried your galaxy-central-homogeneous-composite-datatypes implementation, 
 works great thank you (and Jorrit).

 A couple of fixes:

 1. Add multi_upload.xml to too_conf.xml 2.
 lib/galaxy/tools/parameters/grouping.py line 322 (in get_filenames( context 
 )) -
 if ftp_files is not None:
Remove is not None as ftp_files is empty [], but not None, then line 
 331 user_ftp_dir = os.path.join

Re: [galaxy-dev] pass more information on a dataset merge

2012-10-19 Thread Alex.Khassapov
Hi John,

what I don't get - I specify the output format m:grd, my tool generates 
multiple output files in dataset_id_files folder, but the dataset_id.dat file 
is empty. I need to call this regenerate_primary_file() to add the HTML with 
the file list to the dat file. But I'm not sure where?

-Alex

-Original Message-
From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of John Chilton
Sent: Friday, 19 October 2012 6:16 AM
To: Khassapov, Alex (CSIRO IMT, Clayton)
Subject: Re: [galaxy-dev] pass more information on a dataset merge

On Tue, Oct 16, 2012 at 11:11 PM,  alex.khassa...@csiro.au wrote:
 Hi John,

 I am definitely interested in this idea, not only me - we are currently 
 working on moving a few scientific tools (not related to genome) into cloud 
 using Galaxy.

Great. My interests in Galaxy are mostly outside of genomics as well, it is 
good to have more people utilizing Galaxy in this way because it will force the 
platform to become more generic and address more broader use cases.


 We will try it further and see if we need any changes. For now one 
 improvement would be nice, make dataset_id.dat contain list of paths to the 
 location of the uploaded files, so by displaying html page the user could 
 just click on the link and download the file.


Code that attempted to do this was in there, but didn't work obviously. I have 
now fixed it up.

Thanks for beta testing.

-John

 We are pretty new to Galaxy, so our understanding of Galaxy is pretty limited.

 Thanks again,

 Alex


 -Original Message-
 From: jmchil...@gmail.com [mailto:jmchil...@gmail.com] On Behalf Of 
 John Chilton
 Sent: Wednesday, 17 October 2012 3:21 AM
 To: Khassapov, Alex (CSIRO IMT, Clayton)
 Subject: Re: [galaxy-dev] pass more information on a dataset merge

 Wow, thanks for the rapid feedback! I have made the changes you have 
 suggested. It seems you must be interested in this idea/implementation. Let 
 me know if you have specific use cases/requirements in mind and/or if you 
 would be interested in write access to the repository.

 -John

 On Mon, Oct 15, 2012 at 11:51 PM,  alex.khassa...@csiro.au wrote:
 Hi John,

 I tried your galaxy-central-homogeneous-composite-datatypes implementation, 
 works great thank you (and Jorrit).

 A couple of fixes:

 1. Add multi_upload.xml to too_conf.xml 2.
 lib/galaxy/tools/parameters/grouping.py line 322 (in get_filenames( context 
 )) -
 if ftp_files is not None:
Remove is not None as ftp_files is empty [], but not None, then line 
 331 user_ftp_dir = os.path.join( trans.app.config.ftp_upload_dir, 
 trans.user.email ) throws an exeption if ftp_upload_dir isn't set.

 Alex

 -Original Message-
 From: galaxy-dev-boun...@lists.bx.psu.edu
 [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of John 
 Chilton
 Sent: Tuesday, 16 October 2012 1:07 AM
 To: Jorrit Boekel
 Cc: galaxy-dev@lists.bx.psu.edu
 Subject: Re: [galaxy-dev] pass more information on a dataset merge

 Here is an implementation of the implicit multi-file composite datatypes 
 piece of that idea. I think the implicit parallelism may be harder.

 https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite-da
 t
 atypes/compare

 Jorrit do you have any objection to me trying to get this included in 
 galaxy-central (this is 95% code I stole from you)? I made the changes 
 against a clean galaxy-central fork and included nothing proteomics specific 
 in anticipation of trying to do that. I have talked with Jim Johnson about 
 the idea and he believes it would be useful his mothur metagenomics tools, 
 so the idea is valuable outside of proteomics.

 Galaxy team, would you be okay with including this and if so is there 
 anything you would like to see either at a high level or at the level of the 
 actual implementation.

 -John

 
 John Chilton
 Senior Software Developer
 University of Minnesota Supercomputing Institute
 Office: 612-625-0917
 Cell: 612-226-9223
 Bitbucket: https://bitbucket.org/jmchilton
 Github: https://github.com/jmchilton
 Web: http://jmchilton.net

 On Mon, Oct 8, 2012 at 9:24 AM, John Chilton chil...@msi.umn.edu wrote:
 Jim Johnson and I have been discussing that approach to handling 
 fractionated proteomics samples as well (composite datatypes, not 
 the specifics of the interface for parallelizing).

 My perspective has been that Galaxy should be augmented with better 
 native mechanisms for grouping objects in histories, operating over 
 those groups, building workflows that involve arbitrary numbers of 
 inputs, etc... Composite data types are kindof a kludge, I think 
 they are more useful for grouping HTML files together when you don't 
 care about operating on the constituent parts you just want to view 
 pages a as a report or something. With this proteomic data we are 
 working with, the individual pieces are really interesting right? 
 You want to operate on the individual pieces

Re: [galaxy-dev] pass more information on a dataset merge

2012-10-16 Thread Jorrit Boekel
No objections whatsoever, I'm really happy more people are interested! 
General fixes are definitely to be preferred over a 
whatever-field-specific solution if you ask me.


I am currently running a parallelism where I create symbolic links on 
split, and move result files (as opposed to copying) on a merge. Faster 
than copying back and forward, but it's limited to splitting to the 
amount of files in a set.


cheers,
jorrit

On 10/15/2012 04:07 PM, John Chilton wrote:

Here is an implementation of the implicit multi-file composite
datatypes piece of that idea. I think the implicit parallelism may be
harder.

https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite-datatypes/compare

Jorrit do you have any objection to me trying to get this included in
galaxy-central (this is 95% code I stole from you)? I made the changes
against a clean galaxy-central fork and included nothing proteomics
specific in anticipation of trying to do that. I have talked with Jim
Johnson about the idea and he believes it would be useful his mothur
metagenomics tools, so the idea is valuable outside of proteomics.

Galaxy team, would you be okay with including this and if so is there
anything you would like to see either at a high level or at the level
of the actual implementation.

-John


John Chilton
Senior Software Developer
University of Minnesota Supercomputing Institute
Office: 612-625-0917
Cell: 612-226-9223
Bitbucket: https://bitbucket.org/jmchilton
Github: https://github.com/jmchilton
Web: http://jmchilton.net

On Mon, Oct 8, 2012 at 9:24 AM, John Chilton chil...@msi.umn.edu wrote:

Jim Johnson and I have been discussing that approach to handling
fractionated proteomics samples as well (composite datatypes, not the
specifics of the interface for parallelizing).

My perspective has been that Galaxy should be augmented with better
native mechanisms for grouping objects in histories, operating over
those groups, building workflows that involve arbitrary numbers of
inputs, etc... Composite data types are kindof a kludge, I think they
are more useful for grouping HTML files together when you don't care
about operating on the constituent parts you just want to view pages a
as a report or something. With this proteomic data we are working
with, the individual pieces are really interesting right? You want to
operate on the individual pieces with the full array of tools (not
just these special tools that have the logic for dealing with the
composite datatypes), you want to visualize the files, etc... Putting
these component pieces in the composite data type extra_files path
really limits what you can do with the pieces in Galaxy.

I have a vague idea of something that I think could bridge some of the
gaps between the approaches (though I have no clue on the
feasibility). Looking through your implementation on bitbucket it
looks like you are defining your core datatypes (MS2, CruxSequest) as
subclasses of this composite data type (CompositeMultifile). My
recommendation would be to try to define plain datatypes for these
core datatype (MS2, CruxSequest) and then have the separate composite
datatype sort of delegate to the plain datatypes.

You could then continue to explicitly declare subclasses of the
composite datatype (maybe MS2Set, CruxSequestSet), but also maybe
augement the tool xml so you can do implicit data type instances the
way you can with tabular data for instance (instead of defining
columns you would define the datatype to delegate to).

The next step would be to make the parallelism implicit (i.e pull it
out of the tool wrapper). Your tool wrappers wouldn't reference the
composite datatypes, they would reference the simple datatypes, but
you could add a little icon next to any input that let you replace a
single input with a composite input for that type. It would be kind of
like the run workflow page where you can replace an input with a
multiple inputs. If a composite input (or inputs) are selected the
tool would then produce composite outputs.

For the steps that actually combine multiple inputs, I think in your
case this is perculator maybe (a tool like interprophet or Scaffold
that merges peptide probabilities across runs and groups proteins),
then you could have the same sort of implicit replacement but instead
of for single inputs it could do that for multi-inputs (assuming the
Galaxy powers that be accept my fixes for multi-input tool parameters:
https://bitbucket.org/galaxy/galaxy-central/pull-request/76/multi-input-data-tool-parameter-fixes).

The upshot of all of that would be that then even if these composites
datatypes aren't used widely, other people could still use your
proteomics tools (my users are definitely interested in Crux for
instance) and you could then use other developers' proteomic tools
with your composite datatypes even though they weren't designed with
that use case in mind (I have msconvert, myrimatch, idpicker,

Re: [galaxy-dev] pass more information on a dataset merge

2012-10-15 Thread John Chilton
Here is an implementation of the implicit multi-file composite
datatypes piece of that idea. I think the implicit parallelism may be
harder.

https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite-datatypes/compare

Jorrit do you have any objection to me trying to get this included in
galaxy-central (this is 95% code I stole from you)? I made the changes
against a clean galaxy-central fork and included nothing proteomics
specific in anticipation of trying to do that. I have talked with Jim
Johnson about the idea and he believes it would be useful his mothur
metagenomics tools, so the idea is valuable outside of proteomics.

Galaxy team, would you be okay with including this and if so is there
anything you would like to see either at a high level or at the level
of the actual implementation.

-John


John Chilton
Senior Software Developer
University of Minnesota Supercomputing Institute
Office: 612-625-0917
Cell: 612-226-9223
Bitbucket: https://bitbucket.org/jmchilton
Github: https://github.com/jmchilton
Web: http://jmchilton.net

On Mon, Oct 8, 2012 at 9:24 AM, John Chilton chil...@msi.umn.edu wrote:
 Jim Johnson and I have been discussing that approach to handling
 fractionated proteomics samples as well (composite datatypes, not the
 specifics of the interface for parallelizing).

 My perspective has been that Galaxy should be augmented with better
 native mechanisms for grouping objects in histories, operating over
 those groups, building workflows that involve arbitrary numbers of
 inputs, etc... Composite data types are kindof a kludge, I think they
 are more useful for grouping HTML files together when you don't care
 about operating on the constituent parts you just want to view pages a
 as a report or something. With this proteomic data we are working
 with, the individual pieces are really interesting right? You want to
 operate on the individual pieces with the full array of tools (not
 just these special tools that have the logic for dealing with the
 composite datatypes), you want to visualize the files, etc... Putting
 these component pieces in the composite data type extra_files path
 really limits what you can do with the pieces in Galaxy.

 I have a vague idea of something that I think could bridge some of the
 gaps between the approaches (though I have no clue on the
 feasibility). Looking through your implementation on bitbucket it
 looks like you are defining your core datatypes (MS2, CruxSequest) as
 subclasses of this composite data type (CompositeMultifile). My
 recommendation would be to try to define plain datatypes for these
 core datatype (MS2, CruxSequest) and then have the separate composite
 datatype sort of delegate to the plain datatypes.

 You could then continue to explicitly declare subclasses of the
 composite datatype (maybe MS2Set, CruxSequestSet), but also maybe
 augement the tool xml so you can do implicit data type instances the
 way you can with tabular data for instance (instead of defining
 columns you would define the datatype to delegate to).

 The next step would be to make the parallelism implicit (i.e pull it
 out of the tool wrapper). Your tool wrappers wouldn't reference the
 composite datatypes, they would reference the simple datatypes, but
 you could add a little icon next to any input that let you replace a
 single input with a composite input for that type. It would be kind of
 like the run workflow page where you can replace an input with a
 multiple inputs. If a composite input (or inputs) are selected the
 tool would then produce composite outputs.

 For the steps that actually combine multiple inputs, I think in your
 case this is perculator maybe (a tool like interprophet or Scaffold
 that merges peptide probabilities across runs and groups proteins),
 then you could have the same sort of implicit replacement but instead
 of for single inputs it could do that for multi-inputs (assuming the
 Galaxy powers that be accept my fixes for multi-input tool parameters:
 https://bitbucket.org/galaxy/galaxy-central/pull-request/76/multi-input-data-tool-parameter-fixes).

 The upshot of all of that would be that then even if these composites
 datatypes aren't used widely, other people could still use your
 proteomics tools (my users are definitely interested in Crux for
 instance) and you could then use other developers' proteomic tools
 with your composite datatypes even though they weren't designed with
 that use case in mind (I have msconvert, myrimatch, idpicker,
 proteinpilot, Ira Cooke has X! Tandem, OMSSA, TPP, and NBIC has an
 entire suite of label free quant tools). A third benefit would be that
 people working in other -omicses could make use of the homogenous
 composite datatype implementation without needing to rewrite their
 wrappers and datatypes.

 There is probably something that I am missing that makes this very
 difficult, let me know if you think this is a good idea and what its
 

Re: [galaxy-dev] pass more information on a dataset merge

2012-10-15 Thread Alex.Khassapov
Hi John,

I tried your galaxy-central-homogeneous-composite-datatypes implementation, 
works great thank you (and Jorrit).

A couple of fixes:

1. Add multi_upload.xml to too_conf.xml
2. lib/galaxy/tools/parameters/grouping.py line 322 (in get_filenames( context 
)) - 
if ftp_files is not None:
   Remove is not None as ftp_files is empty [], but not None, then line 331 
user_ftp_dir = os.path.join( trans.app.config.ftp_upload_dir, trans.user.email 
) throws an exeption if ftp_upload_dir isn't set.

Alex

-Original Message-
From: galaxy-dev-boun...@lists.bx.psu.edu 
[mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of John Chilton
Sent: Tuesday, 16 October 2012 1:07 AM
To: Jorrit Boekel
Cc: galaxy-dev@lists.bx.psu.edu
Subject: Re: [galaxy-dev] pass more information on a dataset merge

Here is an implementation of the implicit multi-file composite datatypes piece 
of that idea. I think the implicit parallelism may be harder.

https://bitbucket.org/galaxyp/galaxy-central-homogeneous-composite-datatypes/compare

Jorrit do you have any objection to me trying to get this included in 
galaxy-central (this is 95% code I stole from you)? I made the changes against 
a clean galaxy-central fork and included nothing proteomics specific in 
anticipation of trying to do that. I have talked with Jim Johnson about the 
idea and he believes it would be useful his mothur metagenomics tools, so the 
idea is valuable outside of proteomics.

Galaxy team, would you be okay with including this and if so is there anything 
you would like to see either at a high level or at the level of the actual 
implementation.

-John


John Chilton
Senior Software Developer
University of Minnesota Supercomputing Institute
Office: 612-625-0917
Cell: 612-226-9223
Bitbucket: https://bitbucket.org/jmchilton
Github: https://github.com/jmchilton
Web: http://jmchilton.net

On Mon, Oct 8, 2012 at 9:24 AM, John Chilton chil...@msi.umn.edu wrote:
 Jim Johnson and I have been discussing that approach to handling 
 fractionated proteomics samples as well (composite datatypes, not the 
 specifics of the interface for parallelizing).

 My perspective has been that Galaxy should be augmented with better 
 native mechanisms for grouping objects in histories, operating over 
 those groups, building workflows that involve arbitrary numbers of 
 inputs, etc... Composite data types are kindof a kludge, I think they 
 are more useful for grouping HTML files together when you don't care 
 about operating on the constituent parts you just want to view pages a 
 as a report or something. With this proteomic data we are working 
 with, the individual pieces are really interesting right? You want to 
 operate on the individual pieces with the full array of tools (not 
 just these special tools that have the logic for dealing with the 
 composite datatypes), you want to visualize the files, etc... Putting 
 these component pieces in the composite data type extra_files path 
 really limits what you can do with the pieces in Galaxy.

 I have a vague idea of something that I think could bridge some of the 
 gaps between the approaches (though I have no clue on the 
 feasibility). Looking through your implementation on bitbucket it 
 looks like you are defining your core datatypes (MS2, CruxSequest) as 
 subclasses of this composite data type (CompositeMultifile). My 
 recommendation would be to try to define plain datatypes for these 
 core datatype (MS2, CruxSequest) and then have the separate composite 
 datatype sort of delegate to the plain datatypes.

 You could then continue to explicitly declare subclasses of the 
 composite datatype (maybe MS2Set, CruxSequestSet), but also maybe 
 augement the tool xml so you can do implicit data type instances the 
 way you can with tabular data for instance (instead of defining 
 columns you would define the datatype to delegate to).

 The next step would be to make the parallelism implicit (i.e pull it 
 out of the tool wrapper). Your tool wrappers wouldn't reference the 
 composite datatypes, they would reference the simple datatypes, but 
 you could add a little icon next to any input that let you replace a 
 single input with a composite input for that type. It would be kind of 
 like the run workflow page where you can replace an input with a 
 multiple inputs. If a composite input (or inputs) are selected the 
 tool would then produce composite outputs.

 For the steps that actually combine multiple inputs, I think in your 
 case this is perculator maybe (a tool like interprophet or Scaffold 
 that merges peptide probabilities across runs and groups proteins), 
 then you could have the same sort of implicit replacement but instead 
 of for single inputs it could do that for multi-inputs (assuming the 
 Galaxy powers that be accept my fixes for multi-input tool parameters:
 https://bitbucket.org/galaxy/galaxy-central/pull-request/76/multi-input-data-tool

Re: [galaxy-dev] pass more information on a dataset merge

2012-10-08 Thread John Chilton
Jim Johnson and I have been discussing that approach to handling
fractionated proteomics samples as well (composite datatypes, not the
specifics of the interface for parallelizing).

My perspective has been that Galaxy should be augmented with better
native mechanisms for grouping objects in histories, operating over
those groups, building workflows that involve arbitrary numbers of
inputs, etc... Composite data types are kindof a kludge, I think they
are more useful for grouping HTML files together when you don't care
about operating on the constituent parts you just want to view pages a
as a report or something. With this proteomic data we are working
with, the individual pieces are really interesting right? You want to
operate on the individual pieces with the full array of tools (not
just these special tools that have the logic for dealing with the
composite datatypes), you want to visualize the files, etc... Putting
these component pieces in the composite data type extra_files path
really limits what you can do with the pieces in Galaxy.

I have a vague idea of something that I think could bridge some of the
gaps between the approaches (though I have no clue on the
feasibility). Looking through your implementation on bitbucket it
looks like you are defining your core datatypes (MS2, CruxSequest) as
subclasses of this composite data type (CompositeMultifile). My
recommendation would be to try to define plain datatypes for these
core datatype (MS2, CruxSequest) and then have the separate composite
datatype sort of delegate to the plain datatypes.

You could then continue to explicitly declare subclasses of the
composite datatype (maybe MS2Set, CruxSequestSet), but also maybe
augement the tool xml so you can do implicit data type instances the
way you can with tabular data for instance (instead of defining
columns you would define the datatype to delegate to).

The next step would be to make the parallelism implicit (i.e pull it
out of the tool wrapper). Your tool wrappers wouldn't reference the
composite datatypes, they would reference the simple datatypes, but
you could add a little icon next to any input that let you replace a
single input with a composite input for that type. It would be kind of
like the run workflow page where you can replace an input with a
multiple inputs. If a composite input (or inputs) are selected the
tool would then produce composite outputs.

For the steps that actually combine multiple inputs, I think in your
case this is perculator maybe (a tool like interprophet or Scaffold
that merges peptide probabilities across runs and groups proteins),
then you could have the same sort of implicit replacement but instead
of for single inputs it could do that for multi-inputs (assuming the
Galaxy powers that be accept my fixes for multi-input tool parameters:
https://bitbucket.org/galaxy/galaxy-central/pull-request/76/multi-input-data-tool-parameter-fixes).

The upshot of all of that would be that then even if these composites
datatypes aren't used widely, other people could still use your
proteomics tools (my users are definitely interested in Crux for
instance) and you could then use other developers' proteomic tools
with your composite datatypes even though they weren't designed with
that use case in mind (I have msconvert, myrimatch, idpicker,
proteinpilot, Ira Cooke has X! Tandem, OMSSA, TPP, and NBIC has an
entire suite of label free quant tools). A third benefit would be that
people working in other -omicses could make use of the homogenous
composite datatype implementation without needing to rewrite their
wrappers and datatypes.

There is probably something that I am missing that makes this very
difficult, let me know if you think this is a good idea and what its
feasibility might be. I forked your repo and set off to try to
implement some of this stuff last week and I ended up with my galaxy
pull requests to improve batching workflows and multi-input tool
parameters instead, but I hope to eventually get around to it.

-John


John Chilton
Senior Software Developer
University of Minnesota Supercomputing Institute
Office: 612-625-0917
Cell: 612-226-9223
Bitbucket: https://bitbucket.org/jmchilton
Github: https://github.com/jmchilton
Web: http://jmchilton.net

On Mon, Oct 1, 2012 at 8:24 AM, Jorrit Boekel
jorrit.boe...@scilifelab.se wrote:
 Dear list,

 I thought I was working with fairly large datasets, but they have recently
 started to include ~2Gb files in sets of 50. I have ran these sort of
 things before as merged data by using tar to roll them up in one set, but
 when dealing with 100Gb tarfiles, Galaxy on EC2 seems to get very slow,
 although that's probably because of my implementation of dataset type
 detection (untar and read through files).

 Since tarring/untarring isn't very clean, I want to switch from tarring to
 creating composite files on merge by putting a tool's results into the
 dataset.extra_files_path. This 

[galaxy-dev] pass more information on a dataset merge

2012-10-01 Thread Jorrit Boekel

Dear list,

I thought I was working with fairly large datasets, but they have 
recently started to include ~2Gb files in sets of 50. I have ran these 
sort of things before as merged data by using tar to roll them up in one 
set, but when dealing with 100Gb tarfiles, Galaxy on EC2 seems to get 
very slow, although that's probably because of my implementation of 
dataset type detection (untar and read through files).


Since tarring/untarring isn't very clean, I want to switch from tarring 
to creating composite files on merge by putting a tool's results into 
the dataset.extra_files_path. This doesn't seem to be supported yet, 
because we currently pass in do_merge the output dataset.filename to the 
respective datatype's merge method.


I would like to pass more data to the merge method (let's say the whole 
dataset object) to be able to get the composite files directory and 
'merge' the files in there. Good idea, bad idea? If anyone has views on 
this, I'd love to hear them.


cheers,
jorrit

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

 http://lists.bx.psu.edu/