Re: [galaxy-dev] Request: Option to reduce server data transfer for big workflow in cluster

2013-12-19 Thread Ben Gift
You've been extremely helpful, I appreciate it.

So we went ahead and decided that we need this feature. We're planning to
have a lot of people running huge pipelines, ones that work best on one
node, and there's no reason to do all this writing to any shared file
system when it works best on one node using /tmp/ for intermediate step
data. So I've been working on that.

So far I've made the checkbox for using one node (in run.mako). In
workflow.py I catch this and set a new variable in each step of the
workflow called use_one_node, if checkbox is checked.

Now I'm trying to find where jobs are run, so that I can put the logic in
for getting a node to run on, and setting that as a variable on each step.
Could you point me in the direction of the files/classes associated with
running the history's jobs, and getting nodes (or sending jobs to condor?)?

Thanks, and I'll be sure to push this upstream after it's done if you'd
like it. Maybe as something you can turn on from the universal_wsgi.
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

[galaxy-dev] Request: Option to reduce server data transfer for big workflow in cluster

2013-12-17 Thread Ben Gift
We've run into a scenario lately where we need to run a very large workflow
(huge data in intermediate steps) many times. We can't do this because
Galaxy copies all intermediate steps to all notes, which would bog down the
servers too much.

I asked about something similar before and John mentioned the feature to
automatically delete intermediate step data in a workflow once it
completed, was coming soon. Is that a feature now? That would help.

Ultimately though we can't be copying all this data around to all nodes.
The network just isn't good enough, so I have an idea.

What if we have an option on the 'run workflow' screen to only run on one
node (eliminating the neat Galaxy concurrency ability for that workflow
unfortunately)? Then it just propagates the final step data.

Or maybe only copy to a couple other nodes, to keep concurrency.

If the job errored then in this case I think it should just throw out all
the data, or propagate where it stopped.

I've been trying to work on implementing this myself but it's taking me a
long time. I only just started understanding the pyramid stack, and am
putting in the checkbox in the run.mako template. I still need to learn the
database schema, message passing, and how jobs are stored, and how to tell
condor to only use 1 node, (and more I'm sure) in Galaxy. (I'm drowning)

This seems like a really important feature though as Galaxy gains more
traction as a research tool for bigger projects that demand working with
huge data, and running huge workflows many many times.
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Request: Option to reduce server data transfer for big workflow in cluster

2013-12-17 Thread Ben Gift
Hi John, thanks for the reply.

Yes, I mean Galaxy's default behavior of keeping all the data on all nodes
of our condor cluster. So for instance if I run a job, then the output of
that job is copied to every node in the cluster. Is this not the normal
behavior?


On Tue, Dec 17, 2013 at 9:42 AM, John Chilton chil...@msi.umn.edu wrote:

 Hey Ben,

 Thanks for the e-mail. I did not promise anything was coming soon, I
 only said people were working on parts of it. It is not a feature yet
 unfortunately - multiple people including myself are thinking about
 various parts of this problem though.

 I would like to respond, but I am trying to understand this line: We
 can't do this because Galaxy copies all intermediate steps to all
 no(d)es, which would bog down the servers too much.

 Can you describe how you are doing this staging for me? Is data
 currently being copied around to all the nodes, if so how are you
 doing that? Or are you trying to say that Galaxy requires the data to
 be available on all of the nodes?

 -John

 On Tue, Dec 17, 2013 at 11:15 AM, Ben Gift corn8b...@gmail.com wrote:
  We've run into a scenario lately where we need to run a very large
 workflow
  (huge data in intermediate steps) many times. We can't do this because
  Galaxy copies all intermediate steps to all notes, which would bog down
 the
  servers too much.
 
  I asked about something similar before and John mentioned the feature to
  automatically delete intermediate step data in a workflow once it
 completed,
  was coming soon. Is that a feature now? That would help.
 
  Ultimately though we can't be copying all this data around to all nodes.
 The
  network just isn't good enough, so I have an idea.
 
  What if we have an option on the 'run workflow' screen to only run on one
  node (eliminating the neat Galaxy concurrency ability for that workflow
  unfortunately)? Then it just propagates the final step data.
 
  Or maybe only copy to a couple other nodes, to keep concurrency.
 
  If the job errored then in this case I think it should just throw out all
  the data, or propagate where it stopped.
 
  I've been trying to work on implementing this myself but it's taking me a
  long time. I only just started understanding the pyramid stack, and am
  putting in the checkbox in the run.mako template. I still need to learn
 the
  database schema, message passing, and how jobs are stored, and how to
 tell
  condor to only use 1 node, (and more I'm sure) in Galaxy. (I'm drowning)
 
  This seems like a really important feature though as Galaxy gains more
  traction as a research tool for bigger projects that demand working with
  huge data, and running huge workflows many many times.
 
  ___
  Please keep all replies on the list by using reply all
  in your mail client.  To manage your subscriptions to this
  and other Galaxy lists, please use the interface at:
http://lists.bx.psu.edu/
 
  To search Galaxy mailing lists use the unified search at:
http://galaxyproject.org/search/mailinglists/

___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Request: Option to reduce server data transfer for big workflow in cluster

2013-12-17 Thread Ben Gift
How do you have it set up on the main public galaxy install? I imagine that
people run enough big jobs that there there is enormous use of your shared
file system. How did you scale that to so many nodes without bogging down
the file system with large dataset transfers?

It seems that for now the solution of having a second Galaxy instance will
work well, thank you very much John :) . But I'm still interested in a more
permanent scaled solution. After reading up more on our shared file system
it still seems like heavy traffic is bad, so could my initial idea still be
good?
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

Re: [galaxy-dev] Request: Option to reduce server data transfer for big workflow in cluster

2013-12-17 Thread Ben Gift
This does mostly make sense, and is very illuminating. I appreciate all the
help and I'm sorry that I'm so new to this.

I'm not sure I fully understand though. Do you mean that I could have a
main Galaxy install setup for the 200 nodes, for general purpose use with a
shared file system, and a specialized setup that only uses 10 with its own
shared file system so that it doesn't bog down the main one? And to use the
specialized install when running huge workflows?


On Tue, Dec 17, 2013 at 10:11 AM, John Chilton chil...@msi.umn.edu wrote:

 Hey Ben,

   Hmmm... I don't think Galaxy is doing that - not directly anyway.
 Unless I am mistaken, Galaxy will put the file in one location on the
 web server node or VM. Typically, this location is on a filesystem
 that is shared between the web server and a cluster's compute nodes.
 So I wouldn't describe that as Galaxy copying the data to all of the
 compute nodes. If your cluster doesn't have a shared file system and
 to get around this someone has configured the Galaxy data to be synced
 across all nodes - that would be copying the data to all nodes but I
 don't think Galaxy is doing that. Alternatively, you might have condor
 configured to copy the galaxy data to the remote nodes in such a way
 that everytime a Galaxy job is run on a node, all the data is copied
 to that node?

   Does that make sense?

   So I still don't entirely understand your setup, but my advice is
 pretty general - for now you may want to solve this problem at the
 condor level. I am assuming this is a general purpose condor cluster
 and not setup explicitly for Galaxy? Lets say you have 200 nodes in
 your condor cluster and they cannot all mount the Galaxy filesystem -
 because it would overload the file server being used by Galaxy. I
 think you could setup a FileSystemDomain at the condor level that just
 10 of nodes belonged to say (these 10 nodes can continue to run
 anything in general but Galaxy will only submit to these) This
 filesystemdomain could have a name like galaxy.example.com if
 example.com is your default filesystemdomain. Then you can setup the
 Galaxy condor runner with a requirement such that
 FileSystemDomain=galaxy.example.com and Galaxy jobs will only run on
 these 10 nodes then. Having 10 nodes mount a file server is much more
 manageable than 200.

 -John


 On Tue, Dec 17, 2013 at 11:52 AM, Ben Gift corn8b...@gmail.com wrote:
  Hi John, thanks for the reply.
 
  Yes, I mean Galaxy's default behavior of keeping all the data on all
 nodes
  of our condor cluster. So for instance if I run a job, then the output of
  that job is copied to every node in the cluster. Is this not the normal
  behavior?
 
 
  On Tue, Dec 17, 2013 at 9:42 AM, John Chilton chil...@msi.umn.edu
 wrote:
 
  Hey Ben,
 
  Thanks for the e-mail. I did not promise anything was coming soon, I
  only said people were working on parts of it. It is not a feature yet
  unfortunately - multiple people including myself are thinking about
  various parts of this problem though.
 
  I would like to respond, but I am trying to understand this line: We
  can't do this because Galaxy copies all intermediate steps to all
  no(d)es, which would bog down the servers too much.
 
  Can you describe how you are doing this staging for me? Is data
  currently being copied around to all the nodes, if so how are you
  doing that? Or are you trying to say that Galaxy requires the data to
  be available on all of the nodes?
 
  -John
 
  On Tue, Dec 17, 2013 at 11:15 AM, Ben Gift corn8b...@gmail.com wrote:
   We've run into a scenario lately where we need to run a very large
   workflow
   (huge data in intermediate steps) many times. We can't do this because
   Galaxy copies all intermediate steps to all notes, which would bog
 down
   the
   servers too much.
  
   I asked about something similar before and John mentioned the feature
 to
   automatically delete intermediate step data in a workflow once it
   completed,
   was coming soon. Is that a feature now? That would help.
  
   Ultimately though we can't be copying all this data around to all
 nodes.
   The
   network just isn't good enough, so I have an idea.
  
   What if we have an option on the 'run workflow' screen to only run on
   one
   node (eliminating the neat Galaxy concurrency ability for that
 workflow
   unfortunately)? Then it just propagates the final step data.
  
   Or maybe only copy to a couple other nodes, to keep concurrency.
  
   If the job errored then in this case I think it should just throw out
   all
   the data, or propagate where it stopped.
  
   I've been trying to work on implementing this myself but it's taking
 me
   a
   long time. I only just started understanding the pyramid stack, and am
   putting in the checkbox in the run.mako template. I still need to
 learn
   the
   database schema, message passing, and how jobs are stored, and how to
   tell
   condor to only use 1 node, (and more I'm sure) in Galaxy. (I'm

[galaxy-dev] Write all intermediate workflow data to /tmp/ only

2013-11-12 Thread Ben Gift
I'm working with a lot of data on a cluster (condor). If I save all the
workflow intermediate data, as Galaxy does by default (and rightfully so),
it fills the drives.

How can tell Galaxy to use /tmp/ to store all intermediate data in a
workflow, and keep the result?

I imagine I'll have to work on how Galaxy handles jobs, but I'm hoping
there is something built in for this.

Thanks
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

[galaxy-dev] how to generate .len file for custom genome? (trackster error)

2013-10-01 Thread Ben Gift
Trackster complains to me when I try to load up my custom reference genome
to compare some sample data to.

could not load chroms for this dbkey

I think it's because I'm using the newest g1k human ref genome v37, and I
don't know if the hg19.len file works with it...

I generated my own twobit file for it, but now I think I need the .len
file...

To be honest I'm kind of lost. Where does trackster even look for my
genome, and how do I generate a .len file?
(I read the visualizations page and the custom genome docs page)
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

[galaxy-dev] I wrote a small script to reload tool pages

2013-08-02 Thread Ben Gift
Hi, I wanted to reload xml tool wrapper pages without using the browser.
Going off of this page...

http://lists.bx.psu.edu/pipermail/galaxy-dev/2012-March/009126.html

I wrote a little python script that takes the tool name (the thing at the
top of the xml wrapper file) as a parameter and reloads it.

Setup: Run this command

*curl --cookie-jar galaxy_cookie.txt --data-ascii
'email=YOUR_EMAIL_NAME%40GMAIL.COMwebapp=galaxypassword=YOUR_PASSWORDlogin_button=Login'
YOUR_GALAXY_WEBPAGE/user/login
http://dev1.ab.wurnet.nl:8086/user/login*


and afterwards make sure the galaxy-cookie.txt file is in the same
directory as the python script.
Next, if your instance isn't on localhost:8080, update the python file and
change the localhosts to your instance's address.

Run the python script as

./reload_tool.py --name=your tool name

For instance I run ./reload_tool.py --name=BWA for Illumina
and it reloads it.

Hopefully attachments work on the mailing list. But here's a mirror if not:
http://pastebin.com/jHbbizQV


reload_tool.py
Description: Binary data
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

[galaxy-dev] Choosing which nodes to run on for a whole history (Galaxy cluster)

2013-07-31 Thread Ben Gift
I'm setting up Galaxy and Torque with my cluster and I was wondering if I
could set it up so that nodes could be assigned for a whole history. This
way I can re-run histories on certain nodes for benchmarking runtimes.

It seems that the closest I can get with the built in functionality is
specifying nodes on a per-tool basis, which is good but it would be painful
to have to specify this for each tool I use in a pipeline each time.

So how difficult would it be for me to add this feature, and which files
should I start with in the code base? Or have I misunderstood and this is
actually built in?

Thanks
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/

[galaxy-dev] Does the xml tool wrapper language have ifs or functions?

2013-07-29 Thread Ben Gift
I'm finding it hard to deal with mutually exclusive options that also share
many options between them for certain command line tools. Functions of if
statements could solve this copy/pasting problem.

Thanks
___
Please keep all replies on the list by using reply all
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:
  http://lists.bx.psu.edu/

To search Galaxy mailing lists use the unified search at:
  http://galaxyproject.org/search/mailinglists/