Re: [galaxy-dev] Request: Option to reduce server data transfer for big workflow in cluster

2013-12-20 Thread Nate Coraor
Hi Ben,

The job running code is in lib/galaxy/jobs/.  Galaxy jobs get a
"wrapper" which include the start/finish methods, that's in
__init__.py.  handler.py is what dispatches jobs out to various runner
plugins, finds new jobs to run, and generally controls the operation.
runners/*.py are the individual DRM plugins.

This is an interesting solution and I'd like to see the implementation.

--nate

On Thu, Dec 19, 2013 at 6:33 PM, Ben Gift  wrote:
> You've been extremely helpful, I appreciate it.
>
> So we went ahead and decided that we need this feature. We're planning to
> have a lot of people running huge pipelines, ones that work best on one
> node, and there's no reason to do all this writing to any shared file system
> when it works best on one node using /tmp/ for intermediate step data. So
> I've been working on that.
>
> So far I've made the checkbox for using one node (in run.mako). In
> workflow.py I catch this and set a new variable in each step of the workflow
> called use_one_node, if checkbox is checked.
>
> Now I'm trying to find where jobs are run, so that I can put the logic in
> for getting a node to run on, and setting that as a variable on each step.
> Could you point me in the direction of the files/classes associated with
> running the history's jobs, and getting nodes (or sending jobs to condor?)?
>
> Thanks, and I'll be sure to push this upstream after it's done if you'd like
> it. Maybe as something you can turn on from the universal_wsgi.
>
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] Request: Option to reduce server data transfer for big workflow in cluster

2013-12-19 Thread Ben Gift
You've been extremely helpful, I appreciate it.

So we went ahead and decided that we need this feature. We're planning to
have a lot of people running huge pipelines, ones that work best on one
node, and there's no reason to do all this writing to any shared file
system when it works best on one node using /tmp/ for intermediate step
data. So I've been working on that.

So far I've made the checkbox for using one node (in run.mako). In
workflow.py I catch this and set a new variable in each step of the
workflow called use_one_node, if checkbox is checked.

Now I'm trying to find where jobs are run, so that I can put the logic in
for getting a node to run on, and setting that as a variable on each step.
Could you point me in the direction of the files/classes associated with
running the history's jobs, and getting nodes (or sending jobs to condor?)?

Thanks, and I'll be sure to push this upstream after it's done if you'd
like it. Maybe as something you can turn on from the universal_wsgi.
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Request: Option to reduce server data transfer for big workflow in cluster

2013-12-18 Thread John Chilton
File system performance varies wildly between storage architectures.
There are storage server setups that can easily scale to orders of
magnitude beyond the compute that backs usegalaxy.org - suffice to say
we are currently bound by the number of cores we have available and
not by IO/network performance. (Nate may elaborate on the specifics of
the public servers setup, but I am not sure it is useful to you unless
you have hundreds or thousands (or millions) of dollars to spend on
new storage and network hardware :) ).

Also my idea was not a second Galaxy instance - sorry I did not make
that clearer. It was to restrict your current Galaxy instance to a
smaller portion of your cluster. If your cluster is completely
dedicated to Galaxy however this idea doesn't make sense, but if this
is a shared condor cluster used for other things (besides Galaxy) it
could make sense.

Sorry have not been more helpful.

-John

On Tue, Dec 17, 2013 at 6:12 PM, Ben Gift  wrote:
> How do you have it set up on the main public galaxy install? I imagine that
> people run enough big jobs that there there is enormous use of your shared
> file system. How did you scale that to so many nodes without bogging down
> the file system with large dataset transfers?
>
> It seems that for now the solution of having a second Galaxy instance will
> work well, thank you very much John :) . But I'm still interested in a more
> permanent scaled solution. After reading up more on our shared file system
> it still seems like heavy traffic is bad, so could my initial idea still be
> good?
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/


Re: [galaxy-dev] Request: Option to reduce server data transfer for big workflow in cluster

2013-12-17 Thread Ben Gift
This does mostly make sense, and is very illuminating. I appreciate all the
help and I'm sorry that I'm so new to this.

I'm not sure I fully understand though. Do you mean that I could have a
main Galaxy install setup for the 200 nodes, for general purpose use with a
shared file system, and a specialized setup that only uses 10 with its own
shared file system so that it doesn't bog down the main one? And to use the
specialized install when running huge workflows?


On Tue, Dec 17, 2013 at 10:11 AM, John Chilton  wrote:

> Hey Ben,
>
>   Hmmm... I don't think Galaxy is doing that - not directly anyway.
> Unless I am mistaken, Galaxy will put the file in one location on the
> web server node or VM. Typically, this location is on a filesystem
> that is shared between the web server and a cluster's compute nodes.
> So I wouldn't describe that as Galaxy copying the data to all of the
> compute nodes. If your cluster doesn't have a shared file system and
> to get around this someone has configured the Galaxy data to be synced
> across all nodes - that would be copying the data to all nodes but I
> don't think Galaxy is doing that. Alternatively, you might have condor
> configured to copy the galaxy data to the remote nodes in such a way
> that everytime a Galaxy job is run on a node, all the data is copied
> to that node?
>
>   Does that make sense?
>
>   So I still don't entirely understand your setup, but my advice is
> pretty general - for now you may want to solve this problem at the
> condor level. I am assuming this is a general purpose condor cluster
> and not setup explicitly for Galaxy? Lets say you have 200 nodes in
> your condor cluster and they cannot all mount the Galaxy filesystem -
> because it would overload the file server being used by Galaxy. I
> think you could setup a FileSystemDomain at the condor level that just
> 10 of nodes belonged to say (these 10 nodes can continue to run
> anything in general but Galaxy will only submit to these) This
> filesystemdomain could have a name like galaxy.example.com if
> example.com is your default filesystemdomain. Then you can setup the
> Galaxy condor runner with a requirement such that
> "FileSystemDomain=galaxy.example.com" and Galaxy jobs will only run on
> these 10 nodes then. Having 10 nodes mount a file server is much more
> manageable than 200.
>
> -John
>
>
> On Tue, Dec 17, 2013 at 11:52 AM, Ben Gift  wrote:
> > Hi John, thanks for the reply.
> >
> > Yes, I mean Galaxy's default behavior of keeping all the data on all
> nodes
> > of our condor cluster. So for instance if I run a job, then the output of
> > that job is copied to every node in the cluster. Is this not the normal
> > behavior?
> >
> >
> > On Tue, Dec 17, 2013 at 9:42 AM, John Chilton 
> wrote:
> >>
> >> Hey Ben,
> >>
> >> Thanks for the e-mail. I did not promise anything was coming soon, I
> >> only said people were working on parts of it. It is not a feature yet
> >> unfortunately - multiple people including myself are thinking about
> >> various parts of this problem though.
> >>
> >> I would like to respond, but I am trying to understand this line: "We
> >> can't do this because Galaxy copies all intermediate steps to all
> >> no(d)es, which would bog down the servers too much."
> >>
> >> Can you describe how you are doing this staging for me? Is data
> >> currently being copied around to all the nodes, if so how are you
> >> doing that? Or are you trying to say that Galaxy requires the data to
> >> be available on all of the nodes?
> >>
> >> -John
> >>
> >> On Tue, Dec 17, 2013 at 11:15 AM, Ben Gift  wrote:
> >> > We've run into a scenario lately where we need to run a very large
> >> > workflow
> >> > (huge data in intermediate steps) many times. We can't do this because
> >> > Galaxy copies all intermediate steps to all notes, which would bog
> down
> >> > the
> >> > servers too much.
> >> >
> >> > I asked about something similar before and John mentioned the feature
> to
> >> > automatically delete intermediate step data in a workflow once it
> >> > completed,
> >> > was coming soon. Is that a feature now? That would help.
> >> >
> >> > Ultimately though we can't be copying all this data around to all
> nodes.
> >> > The
> >> > network just isn't good enough, so I have an idea.
> >> >
> >> > What if we have an option on the 'run workflow' screen to only run on
> >> > one
> >> > node (eliminating the neat Galaxy concurrency ability for that
> workflow
> >> > unfortunately)? Then it just propagates the final step data.
> >> >
> >> > Or maybe only copy to a couple other nodes, to keep concurrency.
> >> >
> >> > If the job errored then in this case I think it should just throw out
> >> > all
> >> > the data, or propagate where it stopped.
> >> >
> >> > I've been trying to work on implementing this myself but it's taking
> me
> >> > a
> >> > long time. I only just started understanding the pyramid stack, and am
> >> > putting in the checkbox in the run.mako template. I still n

Re: [galaxy-dev] Request: Option to reduce server data transfer for big workflow in cluster

2013-12-17 Thread Ben Gift
How do you have it set up on the main public galaxy install? I imagine that
people run enough big jobs that there there is enormous use of your shared
file system. How did you scale that to so many nodes without bogging down
the file system with large dataset transfers?

It seems that for now the solution of having a second Galaxy instance will
work well, thank you very much John :) . But I'm still interested in a more
permanent scaled solution. After reading up more on our shared file system
it still seems like heavy traffic is bad, so could my initial idea still be
good?
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Request: Option to reduce server data transfer for big workflow in cluster

2013-12-17 Thread John Chilton
Hey Ben,

  Hmmm... I don't think Galaxy is doing that - not directly anyway.
Unless I am mistaken, Galaxy will put the file in one location on the
web server node or VM. Typically, this location is on a filesystem
that is shared between the web server and a cluster's compute nodes.
So I wouldn't describe that as Galaxy copying the data to all of the
compute nodes. If your cluster doesn't have a shared file system and
to get around this someone has configured the Galaxy data to be synced
across all nodes - that would be copying the data to all nodes but I
don't think Galaxy is doing that. Alternatively, you might have condor
configured to copy the galaxy data to the remote nodes in such a way
that everytime a Galaxy job is run on a node, all the data is copied
to that node?

  Does that make sense?

  So I still don't entirely understand your setup, but my advice is
pretty general - for now you may want to solve this problem at the
condor level. I am assuming this is a general purpose condor cluster
and not setup explicitly for Galaxy? Lets say you have 200 nodes in
your condor cluster and they cannot all mount the Galaxy filesystem -
because it would overload the file server being used by Galaxy. I
think you could setup a FileSystemDomain at the condor level that just
10 of nodes belonged to say (these 10 nodes can continue to run
anything in general but Galaxy will only submit to these) This
filesystemdomain could have a name like galaxy.example.com if
example.com is your default filesystemdomain. Then you can setup the
Galaxy condor runner with a requirement such that
"FileSystemDomain=galaxy.example.com" and Galaxy jobs will only run on
these 10 nodes then. Having 10 nodes mount a file server is much more
manageable than 200.

-John


On Tue, Dec 17, 2013 at 11:52 AM, Ben Gift  wrote:
> Hi John, thanks for the reply.
>
> Yes, I mean Galaxy's default behavior of keeping all the data on all nodes
> of our condor cluster. So for instance if I run a job, then the output of
> that job is copied to every node in the cluster. Is this not the normal
> behavior?
>
>
> On Tue, Dec 17, 2013 at 9:42 AM, John Chilton  wrote:
>>
>> Hey Ben,
>>
>> Thanks for the e-mail. I did not promise anything was coming soon, I
>> only said people were working on parts of it. It is not a feature yet
>> unfortunately - multiple people including myself are thinking about
>> various parts of this problem though.
>>
>> I would like to respond, but I am trying to understand this line: "We
>> can't do this because Galaxy copies all intermediate steps to all
>> no(d)es, which would bog down the servers too much."
>>
>> Can you describe how you are doing this staging for me? Is data
>> currently being copied around to all the nodes, if so how are you
>> doing that? Or are you trying to say that Galaxy requires the data to
>> be available on all of the nodes?
>>
>> -John
>>
>> On Tue, Dec 17, 2013 at 11:15 AM, Ben Gift  wrote:
>> > We've run into a scenario lately where we need to run a very large
>> > workflow
>> > (huge data in intermediate steps) many times. We can't do this because
>> > Galaxy copies all intermediate steps to all notes, which would bog down
>> > the
>> > servers too much.
>> >
>> > I asked about something similar before and John mentioned the feature to
>> > automatically delete intermediate step data in a workflow once it
>> > completed,
>> > was coming soon. Is that a feature now? That would help.
>> >
>> > Ultimately though we can't be copying all this data around to all nodes.
>> > The
>> > network just isn't good enough, so I have an idea.
>> >
>> > What if we have an option on the 'run workflow' screen to only run on
>> > one
>> > node (eliminating the neat Galaxy concurrency ability for that workflow
>> > unfortunately)? Then it just propagates the final step data.
>> >
>> > Or maybe only copy to a couple other nodes, to keep concurrency.
>> >
>> > If the job errored then in this case I think it should just throw out
>> > all
>> > the data, or propagate where it stopped.
>> >
>> > I've been trying to work on implementing this myself but it's taking me
>> > a
>> > long time. I only just started understanding the pyramid stack, and am
>> > putting in the checkbox in the run.mako template. I still need to learn
>> > the
>> > database schema, message passing, and how jobs are stored, and how to
>> > tell
>> > condor to only use 1 node, (and more I'm sure) in Galaxy. (I'm drowning)
>> >
>> > This seems like a really important feature though as Galaxy gains more
>> > traction as a research tool for bigger projects that demand working with
>> > huge data, and running huge workflows many many times.
>> >
>> > ___
>> > Please keep all replies on the list by using "reply all"
>> > in your mail client.  To manage your subscriptions to this
>> > and other Galaxy lists, please use the interface at:
>> >   http://lists.bx.psu.edu/
>> >
>> > To search Galaxy mailing list

Re: [galaxy-dev] Request: Option to reduce server data transfer for big workflow in cluster

2013-12-17 Thread Ben Gift
Hi John, thanks for the reply.

Yes, I mean Galaxy's default behavior of keeping all the data on all nodes
of our condor cluster. So for instance if I run a job, then the output of
that job is copied to every node in the cluster. Is this not the normal
behavior?


On Tue, Dec 17, 2013 at 9:42 AM, John Chilton  wrote:

> Hey Ben,
>
> Thanks for the e-mail. I did not promise anything was coming soon, I
> only said people were working on parts of it. It is not a feature yet
> unfortunately - multiple people including myself are thinking about
> various parts of this problem though.
>
> I would like to respond, but I am trying to understand this line: "We
> can't do this because Galaxy copies all intermediate steps to all
> no(d)es, which would bog down the servers too much."
>
> Can you describe how you are doing this staging for me? Is data
> currently being copied around to all the nodes, if so how are you
> doing that? Or are you trying to say that Galaxy requires the data to
> be available on all of the nodes?
>
> -John
>
> On Tue, Dec 17, 2013 at 11:15 AM, Ben Gift  wrote:
> > We've run into a scenario lately where we need to run a very large
> workflow
> > (huge data in intermediate steps) many times. We can't do this because
> > Galaxy copies all intermediate steps to all notes, which would bog down
> the
> > servers too much.
> >
> > I asked about something similar before and John mentioned the feature to
> > automatically delete intermediate step data in a workflow once it
> completed,
> > was coming soon. Is that a feature now? That would help.
> >
> > Ultimately though we can't be copying all this data around to all nodes.
> The
> > network just isn't good enough, so I have an idea.
> >
> > What if we have an option on the 'run workflow' screen to only run on one
> > node (eliminating the neat Galaxy concurrency ability for that workflow
> > unfortunately)? Then it just propagates the final step data.
> >
> > Or maybe only copy to a couple other nodes, to keep concurrency.
> >
> > If the job errored then in this case I think it should just throw out all
> > the data, or propagate where it stopped.
> >
> > I've been trying to work on implementing this myself but it's taking me a
> > long time. I only just started understanding the pyramid stack, and am
> > putting in the checkbox in the run.mako template. I still need to learn
> the
> > database schema, message passing, and how jobs are stored, and how to
> tell
> > condor to only use 1 node, (and more I'm sure) in Galaxy. (I'm drowning)
> >
> > This seems like a really important feature though as Galaxy gains more
> > traction as a research tool for bigger projects that demand working with
> > huge data, and running huge workflows many many times.
> >
> > ___
> > Please keep all replies on the list by using "reply all"
> > in your mail client.  To manage your subscriptions to this
> > and other Galaxy lists, please use the interface at:
> >   http://lists.bx.psu.edu/
> >
> > To search Galaxy mailing lists use the unified search at:
> >   http://galaxyproject.org/search/mailinglists/
>
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Request: Option to reduce server data transfer for big workflow in cluster

2013-12-17 Thread John Chilton
Hey Ben,

Thanks for the e-mail. I did not promise anything was coming soon, I
only said people were working on parts of it. It is not a feature yet
unfortunately - multiple people including myself are thinking about
various parts of this problem though.

I would like to respond, but I am trying to understand this line: "We
can't do this because Galaxy copies all intermediate steps to all
no(d)es, which would bog down the servers too much."

Can you describe how you are doing this staging for me? Is data
currently being copied around to all the nodes, if so how are you
doing that? Or are you trying to say that Galaxy requires the data to
be available on all of the nodes?

-John

On Tue, Dec 17, 2013 at 11:15 AM, Ben Gift  wrote:
> We've run into a scenario lately where we need to run a very large workflow
> (huge data in intermediate steps) many times. We can't do this because
> Galaxy copies all intermediate steps to all notes, which would bog down the
> servers too much.
>
> I asked about something similar before and John mentioned the feature to
> automatically delete intermediate step data in a workflow once it completed,
> was coming soon. Is that a feature now? That would help.
>
> Ultimately though we can't be copying all this data around to all nodes. The
> network just isn't good enough, so I have an idea.
>
> What if we have an option on the 'run workflow' screen to only run on one
> node (eliminating the neat Galaxy concurrency ability for that workflow
> unfortunately)? Then it just propagates the final step data.
>
> Or maybe only copy to a couple other nodes, to keep concurrency.
>
> If the job errored then in this case I think it should just throw out all
> the data, or propagate where it stopped.
>
> I've been trying to work on implementing this myself but it's taking me a
> long time. I only just started understanding the pyramid stack, and am
> putting in the checkbox in the run.mako template. I still need to learn the
> database schema, message passing, and how jobs are stored, and how to tell
> condor to only use 1 node, (and more I'm sure) in Galaxy. (I'm drowning)
>
> This seems like a really important feature though as Galaxy gains more
> traction as a research tool for bigger projects that demand working with
> huge data, and running huge workflows many many times.
>
> ___
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
>   http://lists.bx.psu.edu/
>
> To search Galaxy mailing lists use the unified search at:
>   http://galaxyproject.org/search/mailinglists/
___
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/