Re: [galaxy-dev] using Galaxy for map/reduce

2011-08-26 Thread Edward Kirton
yes, many tools don't read from stdin, you're right.  in practice, i
actually have each task write it's part to the node's local scratch
disk and also do implicit conversions in this step as well (e.g.
scatter fastq as fasta).  but not all clusters have a local
scratchdisk.
also, as you mentioned, the seek solution wouldn't work for compressed infiles.

as i try to avoid working on the galaxy internals, i implemented this
as a command-line utility.
e.g.
psub --fastqToFasta $infile --cat $outfile qctool.py $infile
$outfile
instead of the nonparallel: qctool.py $infile $outfile

but it would be nice to see this functionality in galaxy.  i thought
about reimplementing this as a drmaa_epc.py job runner but noticed
there was already tasks.py.


On Fri, Aug 26, 2011 at 12:41 PM, Duddy, John  wrote:
> Many of the tools out there work on files, and assume they are supposed to 
> work on the whole file (or take arguments for subsets that vary from tool to 
> tool).
>
> I'm working on a way for Galaxy to handle all these tools transparently, even 
> if, as in my case, the files are compressed but the tools cannot read 
> compressed files.
>
> John Duddy
> Sr. Staff Software Engineer
> Illumina, Inc.
> 9885 Towne Centre Drive
> San Diego, CA 92121
> Tel: 858-736-3584
> E-mail: jdu...@illumina.com
>
>
> -Original Message-
> From: Edward Kirton [mailto:eskir...@lbl.gov]
> Sent: Friday, August 26, 2011 12:34 PM
> To: Duddy, John
> Cc: galaxy-...@bx.psu.edu
> Subject: Re: [galaxy-dev] using Galaxy for map/reduce
>
> Not intending to hijack the thread, but in response to John's comment
> -- I, too, made a general solution for embarassingly parallel problems
> but instead of splitting the large files on disk, I just use seek to
> move the file pointer so each task can grab it's part.
>
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] using Galaxy for map/reduce

2011-08-26 Thread Duddy, John
Many of the tools out there work on files, and assume they are supposed to work 
on the whole file (or take arguments for subsets that vary from tool to tool).

I'm working on a way for Galaxy to handle all these tools transparently, even 
if, as in my case, the files are compressed but the tools cannot read 
compressed files.

John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jdu...@illumina.com


-Original Message-
From: Edward Kirton [mailto:eskir...@lbl.gov] 
Sent: Friday, August 26, 2011 12:34 PM
To: Duddy, John
Cc: galaxy-...@bx.psu.edu
Subject: Re: [galaxy-dev] using Galaxy for map/reduce

Not intending to hijack the thread, but in response to John's comment
-- I, too, made a general solution for embarassingly parallel problems
but instead of splitting the large files on disk, I just use seek to
move the file pointer so each task can grab it's part.

On Tue, Aug 2, 2011 at 10:54 AM, Duddy, John  wrote:
> I did something similar, but implemented as an evolution of the original 
> "basic" parallelism (see BWA), that:
> - Moved the splitting of input files into the datatype classes
> - Allowed any number of inputs to be split, as long as they were the same 
> datatype (so they were mutually consistent - think paired end fastq files)
> - Allowed other inputs to be shared among jobs
> - Merged any number of outputs, which merge code implemented in the datatype 
> classes
>
> This worked functionally, but the IO required to split large files has proved 
> too much for something like a whole genome (~500GB)
>
> I was thinking of something philosophically similar to your dataset container 
> idea, but more in the idea that a dataset is no longer a "file", so the jobs 
> running on subsets of the dataset would just ask for the parts they need. 
> Galaxy would take care of preserving the abstraction that the subset of the 
> dataset is a single input file, perhaps by extracting the subset to a 
> temporary file on local storage. Similarly, the merged outputs would just be 
> held in the target dataset, not copied, thus making the IO cost for the 
> "merge" 0 for the simple case where it is mere concatenation.
>
> John Duddy
> Sr. Staff Software Engineer
> Illumina, Inc.
> 9885 Towne Centre Drive
> San Diego, CA 92121
> Tel: 858-736-3584
> E-mail: jdu...@illumina.com
>
>
> -Original Message-
> From: galaxy-dev-boun...@lists.bx.psu.edu 
> [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Andrew Straw
> Sent: Tuesday, August 02, 2011 7:13 AM
> To: galaxy-...@bx.psu.edu
> Subject: [galaxy-dev] using Galaxy for map/reduce
>
> Hi all,
>
> I've been investigating use of Galaxy for our lab and it has many
> attractive aspects -- a big thank you to all involved.
>
> We still have a couple of related sticking points, however, that I would
> like to get the Galaxy developers' feedback on. Basically, I want to use
> Galaxy to run Map/Reduce type analysis on many initial data files. What
> I mean is that I want to take many initial datasets (e.g. 250 or more),
> perhaps already stored in a library, and then apply a workflow to each
> and every one of them (the Map step). Then, on the many result datasets
> (one from each of the initial datasets), I want to run a Reduce step
> which creates a single dataset. I have achieved this in an imperfect and
> not-quite-working way with a few tricks, but I hope that with a little
> work, Galaxy could be much better for this type of use case.
>
> I have a couple of specific problems and a proposal for a general solution:
>
> 1) My first specific problem is that loading many datasets (e.g. 250)
> into history causes the javascript running locally withing a browser to
> be extremely slow.
>
> 2) My second specific problem is that applying a workflow with N steps
> to many datasets creates even more datasets (Nx250 additional datasets).
> In addition to the slow Javascript problem, there seems to be other
> issues I haven't diagnosed further, but the console in which I'm running
> run.sh indicates many errors of the type "Exception AssertionError:
> AssertionError('State  object at 0x7f5c18c47990> is not present in this identity map',) in
>   0x7f5c18c47990>> ignored". Furthermore the webserver gets slow and my
> nginx frontend proxy gives 504 gateway time-outs.
>
> 3) There's no good way to do reduce within Galaxy. Currently I work
> around this by having a tool type which takes as an input a dataset and
> then uploads this to a self-written webserver, which then collects such
> uploads, performs the reduce, and offers a download link for the user to
> collect the reduc

Re: [galaxy-dev] using Galaxy for map/reduce

2011-08-26 Thread Edward Kirton
Not intending to hijack the thread, but in response to John's comment
-- I, too, made a general solution for embarassingly parallel problems
but instead of splitting the large files on disk, I just use seek to
move the file pointer so each task can grab it's part.

On Tue, Aug 2, 2011 at 10:54 AM, Duddy, John  wrote:
> I did something similar, but implemented as an evolution of the original 
> "basic" parallelism (see BWA), that:
> - Moved the splitting of input files into the datatype classes
> - Allowed any number of inputs to be split, as long as they were the same 
> datatype (so they were mutually consistent - think paired end fastq files)
> - Allowed other inputs to be shared among jobs
> - Merged any number of outputs, which merge code implemented in the datatype 
> classes
>
> This worked functionally, but the IO required to split large files has proved 
> too much for something like a whole genome (~500GB)
>
> I was thinking of something philosophically similar to your dataset container 
> idea, but more in the idea that a dataset is no longer a "file", so the jobs 
> running on subsets of the dataset would just ask for the parts they need. 
> Galaxy would take care of preserving the abstraction that the subset of the 
> dataset is a single input file, perhaps by extracting the subset to a 
> temporary file on local storage. Similarly, the merged outputs would just be 
> held in the target dataset, not copied, thus making the IO cost for the 
> "merge" 0 for the simple case where it is mere concatenation.
>
> John Duddy
> Sr. Staff Software Engineer
> Illumina, Inc.
> 9885 Towne Centre Drive
> San Diego, CA 92121
> Tel: 858-736-3584
> E-mail: jdu...@illumina.com
>
>
> -Original Message-
> From: galaxy-dev-boun...@lists.bx.psu.edu 
> [mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Andrew Straw
> Sent: Tuesday, August 02, 2011 7:13 AM
> To: galaxy-...@bx.psu.edu
> Subject: [galaxy-dev] using Galaxy for map/reduce
>
> Hi all,
>
> I've been investigating use of Galaxy for our lab and it has many
> attractive aspects -- a big thank you to all involved.
>
> We still have a couple of related sticking points, however, that I would
> like to get the Galaxy developers' feedback on. Basically, I want to use
> Galaxy to run Map/Reduce type analysis on many initial data files. What
> I mean is that I want to take many initial datasets (e.g. 250 or more),
> perhaps already stored in a library, and then apply a workflow to each
> and every one of them (the Map step). Then, on the many result datasets
> (one from each of the initial datasets), I want to run a Reduce step
> which creates a single dataset. I have achieved this in an imperfect and
> not-quite-working way with a few tricks, but I hope that with a little
> work, Galaxy could be much better for this type of use case.
>
> I have a couple of specific problems and a proposal for a general solution:
>
> 1) My first specific problem is that loading many datasets (e.g. 250)
> into history causes the javascript running locally withing a browser to
> be extremely slow.
>
> 2) My second specific problem is that applying a workflow with N steps
> to many datasets creates even more datasets (Nx250 additional datasets).
> In addition to the slow Javascript problem, there seems to be other
> issues I haven't diagnosed further, but the console in which I'm running
> run.sh indicates many errors of the type "Exception AssertionError:
> AssertionError('State  object at 0x7f5c18c47990> is not present in this identity map',) in
>   0x7f5c18c47990>> ignored". Furthermore the webserver gets slow and my
> nginx frontend proxy gives 504 gateway time-outs.
>
> 3) There's no good way to do reduce within Galaxy. Currently I work
> around this by having a tool type which takes as an input a dataset and
> then uploads this to a self-written webserver, which then collects such
> uploads, performs the reduce, and offers a download link for the user to
> collect the reduced dataset. The user must manually then upload this
> dataset back into Galaxy for further processing.
>
> My proposal for a general solution, and what I'd be interested in
> feedback on, is an idea of a "dataset container" (this is just a working
> name). It would look and act much like a dataset in the history, but
> would in fact be a logical construct that merely bundles together a
> homogeneous bunch of datasets. When a tool (or a workflow) is applied to
> a dataset container, Galaxy would automatically create a new container
> in which each dataset in this new container is the result of running the
> tool. (Workflows with N steps

Re: [galaxy-dev] using Galaxy for map/reduce

2011-08-02 Thread Duddy, John
I did something similar, but implemented as an evolution of the original 
"basic" parallelism (see BWA), that:
- Moved the splitting of input files into the datatype classes
- Allowed any number of inputs to be split, as long as they were the same 
datatype (so they were mutually consistent - think paired end fastq files)
- Allowed other inputs to be shared among jobs
- Merged any number of outputs, which merge code implemented in the datatype 
classes

This worked functionally, but the IO required to split large files has proved 
too much for something like a whole genome (~500GB)

I was thinking of something philosophically similar to your dataset container 
idea, but more in the idea that a dataset is no longer a "file", so the jobs 
running on subsets of the dataset would just ask for the parts they need. 
Galaxy would take care of preserving the abstraction that the subset of the 
dataset is a single input file, perhaps by extracting the subset to a temporary 
file on local storage. Similarly, the merged outputs would just be held in the 
target dataset, not copied, thus making the IO cost for the "merge" 0 for the 
simple case where it is mere concatenation.

John Duddy
Sr. Staff Software Engineer
Illumina, Inc.
9885 Towne Centre Drive
San Diego, CA 92121
Tel: 858-736-3584
E-mail: jdu...@illumina.com


-Original Message-
From: galaxy-dev-boun...@lists.bx.psu.edu 
[mailto:galaxy-dev-boun...@lists.bx.psu.edu] On Behalf Of Andrew Straw
Sent: Tuesday, August 02, 2011 7:13 AM
To: galaxy-...@bx.psu.edu
Subject: [galaxy-dev] using Galaxy for map/reduce

Hi all,

I've been investigating use of Galaxy for our lab and it has many
attractive aspects -- a big thank you to all involved.

We still have a couple of related sticking points, however, that I would
like to get the Galaxy developers' feedback on. Basically, I want to use
Galaxy to run Map/Reduce type analysis on many initial data files. What
I mean is that I want to take many initial datasets (e.g. 250 or more),
perhaps already stored in a library, and then apply a workflow to each
and every one of them (the Map step). Then, on the many result datasets
(one from each of the initial datasets), I want to run a Reduce step
which creates a single dataset. I have achieved this in an imperfect and
not-quite-working way with a few tricks, but I hope that with a little
work, Galaxy could be much better for this type of use case.

I have a couple of specific problems and a proposal for a general solution:

1) My first specific problem is that loading many datasets (e.g. 250)
into history causes the javascript running locally withing a browser to
be extremely slow.

2) My second specific problem is that applying a workflow with N steps
to many datasets creates even more datasets (Nx250 additional datasets).
In addition to the slow Javascript problem, there seems to be other
issues I haven't diagnosed further, but the console in which I'm running
run.sh indicates many errors of the type "Exception AssertionError:
AssertionError('State  is not present in this identity map',) in
> ignored". Furthermore the webserver gets slow and my
nginx frontend proxy gives 504 gateway time-outs.

3) There's no good way to do reduce within Galaxy. Currently I work
around this by having a tool type which takes as an input a dataset and
then uploads this to a self-written webserver, which then collects such
uploads, performs the reduce, and offers a download link for the user to
collect the reduced dataset. The user must manually then upload this
dataset back into Galaxy for further processing.

My proposal for a general solution, and what I'd be interested in
feedback on, is an idea of a "dataset container" (this is just a working
name). It would look and act much like a dataset in the history, but
would in fact be a logical construct that merely bundles together a
homogeneous bunch of datasets. When a tool (or a workflow) is applied to
a dataset container, Galaxy would automatically create a new container
in which each dataset in this new container is the result of running the
tool. (Workflows with N steps would thus generate N new containers.) The
thing I like about this idea is that it preserves the ability to use
tools and workflows on both individual datasets and, with some
additional logic, on these new containers. In particular, I don't think
the tools and workflows themselves would have to be modified. This would
seemingly mitigate the slow Javascript issue by only showing a few items
in the history window (even though Galaxy may have launched many jobs in
the background). Furthermore, a new Reduce tool type could then act to
take a dataset container as input and output a single dataset.

A library doesn't seem a good candidate for the dataset container idea I
have above. I realize that a library also bundles together datasets, but
it has other attributes 

Re: [galaxy-dev] using Galaxy for map/reduce

2011-08-02 Thread Andrew Straw
On 08/02/2011 06:43 PM, James Taylor wrote:
> On Aug 2, 2011, at 10:12 AM, Andrew Straw wrote:
>
>> 1) My first specific problem is that loading many datasets (e.g. 250)
>> into history causes the javascript running locally withing a browser to
>> be extremely slow.
> What browser are you using?
Primarily Firefox 3.6.18 as packaged by Ubuntu 10.04 amd64. I know more
recent browsers have faster JS interpreters, but I'm hoping that JS
interpreter speed optimizations will be largely irrelevant with the
proposal for the dataset container proposal I made. (I just tested, and
it's certainly true that Google Chromium 12.0.742.112 on the same system
is much faster.)


>> 2) My second specific problem is that applying a workflow with N steps
>> to many datasets creates even more datasets (Nx250 additional datasets).
>> In addition to the slow Javascript problem, there seems to be other
>> issues I haven't diagnosed further, but the console in which I'm running
>> run.sh indicates many errors of the type "Exception AssertionError:
>> AssertionError('State > object at 0x7f5c18c47990> is not present in this identity map',) in
>> > > 0x7f5c18c47990>> ignored". Furthermore the webserver gets slow and my
>> nginx frontend proxy gives 504 gateway time-outs.
> Yes, creating all the jobs and datasets for a workflow is relatively slow 
> right now. We have some optimizations for this that are not in the mainline 
> (not well tested) however there is a limit to how fast it can be with so many 
> new datasets and objects being created. 
>
> The better solution is probably to move workflow creation into a background 
> process. Starting the workflow would just save the initial state, and a 
> background process would actually create all the datasets and jobs and get it 
> running. The downside is that the history would not be completely populated 
> by the time the page had returned. 

There could be a little checkbox in the run workflow page which allows
the user to decide whether to fork the workflow creation process. Then
the user could do so for a big job but keep synchronous behavior for
which just a few jobs are scheduled.


-- 
Andrew D. Straw, Ph.D.
Research Institute of Molecular Pathology (IMP)
Vienna, Austria
http://strawlab.org/

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] using Galaxy for map/reduce

2011-08-02 Thread James Taylor

On Aug 2, 2011, at 10:12 AM, Andrew Straw wrote:

> 1) My first specific problem is that loading many datasets (e.g. 250)
> into history causes the javascript running locally withing a browser to
> be extremely slow.

What browser are you using?

> 2) My second specific problem is that applying a workflow with N steps
> to many datasets creates even more datasets (Nx250 additional datasets).
> In addition to the slow Javascript problem, there seems to be other
> issues I haven't diagnosed further, but the console in which I'm running
> run.sh indicates many errors of the type "Exception AssertionError:
> AssertionError('State  object at 0x7f5c18c47990> is not present in this identity map',) in
>   0x7f5c18c47990>> ignored". Furthermore the webserver gets slow and my
> nginx frontend proxy gives 504 gateway time-outs.

Yes, creating all the jobs and datasets for a workflow is relatively slow right 
now. We have some optimizations for this that are not in the mainline (not well 
tested) however there is a limit to how fast it can be with so many new 
datasets and objects being created. 

The better solution is probably to move workflow creation into a background 
process. Starting the workflow would just save the initial state, and a 
background process would actually create all the datasets and jobs and get it 
running. The downside is that the history would not be completely populated by 
the time the page had returned. 

-- jt

James Taylor, Assistant Professor, Biology / Computer Science, Emory University





___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


Re: [galaxy-dev] using Galaxy for map/reduce

2011-08-02 Thread Ravi Madduri
Hi
I really like this proposal. We faced some of the similar issues you talk about 
below when we tried to use galaxy to use High Throughput computing techniques 
(using Condor) for sequencing close to 500 genomes (embarrassingly parallel 
problem). We leveraged (hacked) the dataset construct but it did not map very 
well into the problem we were trying to solve. We ended up taking an approach 
that involved using the Galaxy tool mechanism to create a "Composite Dataset" 
from a filesystem location.  This approach required a configuration file to be 
updated with a directory path containing the datasets.  The directory listing 
was then filtered and displayed to the user of the tool to allow them to select 
a Genome.  The tool would then create a composite dataset consisting of a JSON 
document containing at least a list of all the files in the CG data directory. 
We are not sure how generally useful this tool would be.
On Aug 2, 2011, at 10:12 AM, Andrew Straw wrote:

> Hi all,
> 
> I've been investigating use of Galaxy for our lab and it has many
> attractive aspects -- a big thank you to all involved.
> 
> We still have a couple of related sticking points, however, that I would
> like to get the Galaxy developers' feedback on. Basically, I want to use
> Galaxy to run Map/Reduce type analysis on many initial data files. What
> I mean is that I want to take many initial datasets (e.g. 250 or more),
> perhaps already stored in a library, and then apply a workflow to each
> and every one of them (the Map step). Then, on the many result datasets
> (one from each of the initial datasets), I want to run a Reduce step
> which creates a single dataset. I have achieved this in an imperfect and
> not-quite-working way with a few tricks, but I hope that with a little
> work, Galaxy could be much better for this type of use case.
> 
> I have a couple of specific problems and a proposal for a general solution:
> 
> 1) My first specific problem is that loading many datasets (e.g. 250)
> into history causes the javascript running locally withing a browser to
> be extremely slow.
> 
> 2) My second specific problem is that applying a workflow with N steps
> to many datasets creates even more datasets (Nx250 additional datasets).
> In addition to the slow Javascript problem, there seems to be other
> issues I haven't diagnosed further, but the console in which I'm running
> run.sh indicates many errors of the type "Exception AssertionError:
> AssertionError('State  object at 0x7f5c18c47990> is not present in this identity map',) in
>   0x7f5c18c47990>> ignored". Furthermore the webserver gets slow and my
> nginx frontend proxy gives 504 gateway time-outs.
> 
> 3) There's no good way to do reduce within Galaxy. Currently I work
> around this by having a tool type which takes as an input a dataset and
> then uploads this to a self-written webserver, which then collects such
> uploads, performs the reduce, and offers a download link for the user to
> collect the reduced dataset. The user must manually then upload this
> dataset back into Galaxy for further processing.
> 
> My proposal for a general solution, and what I'd be interested in
> feedback on, is an idea of a "dataset container" (this is just a working
> name). It would look and act much like a dataset in the history, but
> would in fact be a logical construct that merely bundles together a
> homogeneous bunch of datasets. When a tool (or a workflow) is applied to
> a dataset container, Galaxy would automatically create a new container
> in which each dataset in this new container is the result of running the
> tool. (Workflows with N steps would thus generate N new containers.) The
> thing I like about this idea is that it preserves the ability to use
> tools and workflows on both individual datasets and, with some
> additional logic, on these new containers. In particular, I don't think
> the tools and workflows themselves would have to be modified. This would
> seemingly mitigate the slow Javascript issue by only showing a few items
> in the history window (even though Galaxy may have launched many jobs in
> the background). Furthermore, a new Reduce tool type could then act to
> take a dataset container as input and output a single dataset.
> 
> A library doesn't seem a good candidate for the dataset container idea I
> have above. I realize that a library also bundles together datasets, but
> it has other attributes that don't play well with the above idea (the
> idea of hierarchically arranged folders and heterogeneous datasets) nor
> can it be  represented in the history.
> 
> I'm interested in thoughts on this proposal, as I think it would really
> help us, and I think our use case may be representative of what others
> might also like to do. I realize that in my text above I write "with
> some additional logic" to describe the work required to implement this
> idea, but the fact is that I have very little idea about how much work
> this would be. So, practic

Re: [galaxy-dev] using Galaxy for map/reduce

2011-08-02 Thread Peter Cock
On Tue, Aug 2, 2011 at 3:12 PM, Andrew Straw  wrote:
> ...
>
> My proposal for a general solution, and what I'd be interested in
> feedback on, is an idea of a "dataset container" (this is just a working
> name). It would look and act much like a dataset in the history, but
> would in fact be a logical construct that merely bundles together a
> homogeneous bunch of datasets. When a tool (or a workflow) is applied to
> a dataset container, Galaxy would automatically create a new container
> in which each dataset in this new container is the result of running the
> tool. (Workflows with N steps would thus generate N new containers.) The
> thing I like about this idea is that it preserves the ability to use
> tools and workflows on both individual datasets and, with some
> additional logic, on these new containers. In particular, I don't think
> the tools and workflows themselves would have to be modified. This would
> seemingly mitigate the slow Javascript issue by only showing a few items
> in the history window (even though Galaxy may have launched many jobs in
> the background). Furthermore, a new Reduce tool type could then act to
> take a dataset container as input and output a single dataset.
>
> ...

That is a very interesting idea.

Note that in some of the usecases I had in mind the order of the
sub-files was important, but in other cases not. So I think that
internally I think Galaxy would have to store a "dataset collection"
aka "homogeneous filetype collection" as an ordered list of
filenames.

As you observed, at the level of an individual tool, nothing
changes - it gets given a single input file(s) as before, but
now multiple copies of the tool will be running, each with a
different input file (or files for more complex tools).

I had been mulling over what is essentially a special case of
this - a new datatype for "collection of BLAST XML files", and
debating with myself if a zip file or simple concatenation would
work here. In the case of BLAST XML files, there is precedent
from early NCBI BLAST tools outputting concatenated XML
files (which are not valid XML).

My motivating example was the embarrassingly parallel task
of multi-query BLAST searches. Here we can split up the input
query file (*) and run the searches separately (the map step).
The potentially hard part is merging the output (the reduce).
Tabular output and plain text can basically be concatenated
(note we should preserve the original query order). For XML
(or -shudder- HTML output), a bit of data munging is needed.

Your idea is much more elegant, and to me fits nicely with
a general sub-task parallelization framework (as well as your
example of running a single workflow on a collection of data
files).

Peter

(*) You can also split the BLAST database/subject file, and
there are options to adjust the e-value significance accordingly
(so it is calculated using the full database size, not the partial
database size). The downside is the merging of the results is
much more complicated.
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/


[galaxy-dev] using Galaxy for map/reduce

2011-08-02 Thread Andrew Straw
Hi all,

I've been investigating use of Galaxy for our lab and it has many
attractive aspects -- a big thank you to all involved.

We still have a couple of related sticking points, however, that I would
like to get the Galaxy developers' feedback on. Basically, I want to use
Galaxy to run Map/Reduce type analysis on many initial data files. What
I mean is that I want to take many initial datasets (e.g. 250 or more),
perhaps already stored in a library, and then apply a workflow to each
and every one of them (the Map step). Then, on the many result datasets
(one from each of the initial datasets), I want to run a Reduce step
which creates a single dataset. I have achieved this in an imperfect and
not-quite-working way with a few tricks, but I hope that with a little
work, Galaxy could be much better for this type of use case.

I have a couple of specific problems and a proposal for a general solution:

1) My first specific problem is that loading many datasets (e.g. 250)
into history causes the javascript running locally withing a browser to
be extremely slow.

2) My second specific problem is that applying a workflow with N steps
to many datasets creates even more datasets (Nx250 additional datasets).
In addition to the slow Javascript problem, there seems to be other
issues I haven't diagnosed further, but the console in which I'm running
run.sh indicates many errors of the type "Exception AssertionError:
AssertionError('State  is not present in this identity map',) in
> ignored". Furthermore the webserver gets slow and my
nginx frontend proxy gives 504 gateway time-outs.

3) There's no good way to do reduce within Galaxy. Currently I work
around this by having a tool type which takes as an input a dataset and
then uploads this to a self-written webserver, which then collects such
uploads, performs the reduce, and offers a download link for the user to
collect the reduced dataset. The user must manually then upload this
dataset back into Galaxy for further processing.

My proposal for a general solution, and what I'd be interested in
feedback on, is an idea of a "dataset container" (this is just a working
name). It would look and act much like a dataset in the history, but
would in fact be a logical construct that merely bundles together a
homogeneous bunch of datasets. When a tool (or a workflow) is applied to
a dataset container, Galaxy would automatically create a new container
in which each dataset in this new container is the result of running the
tool. (Workflows with N steps would thus generate N new containers.) The
thing I like about this idea is that it preserves the ability to use
tools and workflows on both individual datasets and, with some
additional logic, on these new containers. In particular, I don't think
the tools and workflows themselves would have to be modified. This would
seemingly mitigate the slow Javascript issue by only showing a few items
in the history window (even though Galaxy may have launched many jobs in
the background). Furthermore, a new Reduce tool type could then act to
take a dataset container as input and output a single dataset.

A library doesn't seem a good candidate for the dataset container idea I
have above. I realize that a library also bundles together datasets, but
it has other attributes that don't play well with the above idea (the
idea of hierarchically arranged folders and heterogeneous datasets) nor
can it be  represented in the history.

I'm interested in thoughts on this proposal, as I think it would really
help us, and I think our use case may be representative of what others
might also like to do. I realize that in my text above I write "with
some additional logic" to describe the work required to implement this
idea, but the fact is that I have very little idea about how much work
this would be. So, practically speaking, my question boils down to how
hard would implementing this be, given the existing code base and goals?
And, would such an implementation - if done to the taste of the Galaxy
devs, of course - have a chance of making into the Galaxy distribution?

Thanks,
Andrew

-- 
Andrew D. Straw, Ph.D.
Research Institute of Molecular Pathology (IMP)
Vienna, Austria
http://strawlab.org/

___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/