Re: [galaxy-dev] Managing Data Locality

2013-11-08 Thread Paniagua, Eric
Hi John,

I have now read the top-level documentation for LWR, and gone through the 
sample configurations.  I would appreciate if you would answer a few technical 
questions for me.

1) How exactly is the "staging_directory" in "server.ini.sample" used?  Is that 
intended to be the (final) location at which to put files on the remote server? 
 How is the relative path structure under $GALAXY_ROOT/databases/files handled?

2) What exactly does "persistence_directory" in "server.ini.sample" mean?  
Where should it be located, how will it be used?

3) What exactly does "file_cache_dir" in "server.ini.sample" mean?

4) Does LWR preserve some relative path (e.g. to GALAXY_ROOT) under the above 
directories?

5) Are files renamed when cached?  If so, are they eventually restored to their 
original names?

6) Is it possible to customize the DRMAA and/or qsub requests made by LWR, for 
example to include additional settings such as Project or a memory limit?  Is 
it possible to customize this on a case by case basis, rather than globally?

7) Are there any options for the "queued_drmaa" manager in 
"job_managers.ini.sample" which are not listed in that file?

8) What exactly are the differences between the "queued_drmaa" manager and the 
"queued_cli" manager?  Are there any options for the latter which are not in 
the "job_managers.ini.sample" file?

9) When I attempt to run LWR (not having completed all the mentioned 
preparation steps, namely without setting DRMAA_LIBRARY_PATH), I get a Seg 
fault.  Is this because it can't find DRMAA or is it potentially unrelated?  In 
the latter case, here's the error being output to the console:

./run.sh: line 65: 26277 Segmentation fault  paster serve server.ini "$@"

Lastly, a simple comment, hopefully helpful.  It would be nice if the LWR 
install docs at least mentioned the dependency of PyOpenSSL 0.13 (or later) on 
OpenSSL 0.9.8f (or later), maybe even with a comment that "pip" will listen to 
the environment variables CFLAGS and LDFLAGS in the event one is creating a 
local installation of the OpenSSL library for LWR to use.

Thank you for your time and assistance.

Best,
Eric

From: jmchil...@gmail.com [jmchil...@gmail.com] on behalf of John Chilton 
[chil...@msi.umn.edu]
Sent: Tuesday, November 05, 2013 11:58 AM
To: Paniagua, Eric
Cc: Galaxy Dev [galaxy-...@bx.psu.edu]
Subject: Re: [galaxy-dev] Managing Data Locality

Hey Eric,

I think what you are purposing would be a major development effort and
mirrors major development efforts ongoing. There are  sortof ways to
do this already, with various trade-offs, and none particularly well
documented. So before undertaking this efforts I would dig into some
alternatives.

If you are using PBS, the PBS runner contains some logic for
delegating to PBS for doing this kind of thing - I have never tried
it.

https://bitbucket.org/galaxy/galaxy-central/src/default/lib/galaxy/jobs/runners/pbs.py#cl-245

In may be possible to use a specially configured handler and the
Galaxy object store to stage files to a particular mount before
running jobs - not sure it makes sense in this case. It might be worth
looking into this (having the object store stage your files, instead
of solving it at the job runner level).

My recommendation however would be to investigate the LWR job runner.
There are a bunch of fairly recent developments to enable something
like what you are describing. For specificity lets say you are using
DRMAA to talk to some HPC cluster and Galaxy's file data is stored in
/galaxy/data on the galaxy web server but not on the HPC and there is
some scratch space (/scratch) that is mounted on both the Galaxy web
server and your HPC cluster.

I would stand up an LWR (http://lwr.readthedocs.org/en/latest/) server
right beside Galaxy on your web server. The LWR has a concept of
managers that sort of mirrors the concept of runners in Galaxy - see
the sample config for guidance on how to get it to talk with your
cluster. It could use DRMAA, torque command-line tools, or condor at
this time (I could add new methods e.g. PBS library if that would
help). 
https://bitbucket.org/jmchilton/lwr/src/default/job_managers.ini.sample?at=default

On the Galaxy side, I would then create a job_conf.xml file telling
certain HPC tools to be sent to the LWR. Be sure to enable the LWR
runner at the top (see advanced example config) and then add at least
one LWR destination.

 


  http://localhost:8913/
  
  none
  
  file_actions.json


Then create a file_actions.json file in the Galaxy root directory
(structure of this file is subject to change, current json layout
doesn't feel very Galaxy-ish).

{"paths": [
{"path": "/galaxy/data", "action": "copy"}
] }

More details 

Re: [galaxy-dev] Managing Data Locality

2013-11-08 Thread Paniagua, Eric
Hi John,

I was just wondering, did you have an object store based suggestion as well?  
Logically, this seems to be where this operation should be done, but I don't 
see much infrastructure to support this, such as logic for moving a data object 
between object stores.  (Incidentally, the release of Galaxy I'm running is 
from last April or May.  Would and upgrade to the latest and greatest version 
pull in more support infrastructure for this?)

Regarding your LWR suggestion, admittedly I have not yet read the docs you 
referred me to, but I thought a second email was warranted anyway.  We would in 
fact be using DRMAA to talk to the HPCC (this is being configured as I write), 
and Galaxy's long-term storage lives on its our independent Galaxy server.  As 
I may have commented before, we can't simply mount our Galaxy file systems to 
the HPCC for security reasons.  To make the scenario even more concrete, we are 
currently using the DistributedObjectStore to balance Galaxy's storage 
requirements across three mounted volumes.  I don't expect this to complicate 
the task at hand, but please do let me know if you think it will.  We also 
currently have UGE set up on our Galaxy server, so we've already been using 
DRMAA to submit jobs.  The details for submission to another host are more 
complicated, though.

Does your LWR suggestion involve the use of "scripts/drmaa_external_killer.py", 
"scripts/drmaa_external_runner.py", and "scripts/external_chown_script.py"?  
(Particularly if so, ) Would you be so kind as to point me toward documentation 
for those scripts?  It's not clear to me from their source how they are 
intended to be used or at what stage of the job creation process they would be 
called by Galaxy.  The same applies also to the "file_actions.json" file you 
referred to previously.  Is that a Galaxy file or an LWR file?  Where may I 
find some documentation on the available configuration attributes, options, 
values, and semantics?  Does your LWR suggestion require that the same absolute 
path structure exists (not much information is conveyed by the action name 
"copy"), does it require a certain relative path structure to match on both 
file systems, how does setting that option lead to Galaxy setting the correct 
paths (local to the HPCC) when building the command line?

Our goal is to submit all heavy jobs (e.g. mappers) to the HPCC as the user who 
launches the Galaxy job.  Both the HPCC and our Galaxy instance use LDAP 
logins, so fortunately that's one wrinkle we don't have to worry about.  This 
will help all involved maintain fair quota policies on a per-user basis.  I 
plan to handle the support files (genome indices) by transferring them to the 
HPCC and rewriting the appropriate *.loc files on our Galaxy host with HPCC 
paths.

I appreciate your generous response to my first email, and hope to continue the 
conversation with this email.  Now, I will go RTFM for LWR. :)

Many thanks,
Eric


From: jmchil...@gmail.com [jmchil...@gmail.com] on behalf of John Chilton 
[chil...@msi.umn.edu]
Sent: Tuesday, November 05, 2013 11:58 AM
To: Paniagua, Eric
Cc: Galaxy Dev [galaxy-...@bx.psu.edu]
Subject: Re: [galaxy-dev] Managing Data Locality

Hey Eric,

I think what you are purposing would be a major development effort and
mirrors major development efforts ongoing. There are  sortof ways to
do this already, with various trade-offs, and none particularly well
documented. So before undertaking this efforts I would dig into some
alternatives.

If you are using PBS, the PBS runner contains some logic for
delegating to PBS for doing this kind of thing - I have never tried
it.

https://bitbucket.org/galaxy/galaxy-central/src/default/lib/galaxy/jobs/runners/pbs.py#cl-245

In may be possible to use a specially configured handler and the
Galaxy object store to stage files to a particular mount before
running jobs - not sure it makes sense in this case. It might be worth
looking into this (having the object store stage your files, instead
of solving it at the job runner level).

My recommendation however would be to investigate the LWR job runner.
There are a bunch of fairly recent developments to enable something
like what you are describing. For specificity lets say you are using
DRMAA to talk to some HPC cluster and Galaxy's file data is stored in
/galaxy/data on the galaxy web server but not on the HPC and there is
some scratch space (/scratch) that is mounted on both the Galaxy web
server and your HPC cluster.



I would stand up an LWR (http://lwr.readthedocs.org/en/latest/) server
right beside Galaxy on your web server. The LWR has a concept of
managers that sort of mirrors the concept of runners in Galaxy - see
the sample config for guidance on how to get it to talk with your
cluster. It could use DRMAA, torque command-line tools, or condor at
this

Re: [galaxy-dev] Managing Data Locality

2013-11-05 Thread John Chilton
Hey Eric,

I think what you are purposing would be a major development effort and
mirrors major development efforts ongoing. There are  sortof ways to
do this already, with various trade-offs, and none particularly well
documented. So before undertaking this efforts I would dig into some
alternatives.

If you are using PBS, the PBS runner contains some logic for
delegating to PBS for doing this kind of thing - I have never tried
it.

https://bitbucket.org/galaxy/galaxy-central/src/default/lib/galaxy/jobs/runners/pbs.py#cl-245

In may be possible to use a specially configured handler and the
Galaxy object store to stage files to a particular mount before
running jobs - not sure it makes sense in this case. It might be worth
looking into this (having the object store stage your files, instead
of solving it at the job runner level).

My recommendation however would be to investigate the LWR job runner.
There are a bunch of fairly recent developments to enable something
like what you are describing. For specificity lets say you are using
DRMAA to talk to some HPC cluster and Galaxy's file data is stored in
/galaxy/data on the galaxy web server but not on the HPC and there is
some scratch space (/scratch) that is mounted on both the Galaxy web
server and your HPC cluster.

I would stand up an LWR (http://lwr.readthedocs.org/en/latest/) server
right beside Galaxy on your web server. The LWR has a concept of
managers that sort of mirrors the concept of runners in Galaxy - see
the sample config for guidance on how to get it to talk with your
cluster. It could use DRMAA, torque command-line tools, or condor at
this time (I could add new methods e.g. PBS library if that would
help). 
https://bitbucket.org/jmchilton/lwr/src/default/job_managers.ini.sample?at=default

On the Galaxy side, I would then create a job_conf.xml file telling
certain HPC tools to be sent to the LWR. Be sure to enable the LWR
runner at the top (see advanced example config) and then add at least
one LWR destination.

 


  http://localhost:8913/
  
  none
  
  file_actions.json


Then create a file_actions.json file in the Galaxy root directory
(structure of this file is subject to change, current json layout
doesn't feel very Galaxy-ish).

{"paths": [
{"path": "/galaxy/data", "action": "copy"}
] }

More details on the structure of this file_actions.json file can be
found in the following changeset:
https://bitbucket.org/galaxy/galaxy-central/commits/b0b83be30136e2939a4a4f5d80dda8f8c853c0a2

I am really eager to see the LWR gain adoption and tackle tricky cases
like this, so if there is anything I can do to help please let me know
and contributions in terms of development or documentation would be
greatly appreciated as well.

Hope this helps,
-John

On Tue, Nov 5, 2013 at 8:23 AM, Paniagua, Eric  wrote:
> Dear Galaxy Developers,
>
> I administer a Galaxy instance at Cold Spring Harbor Laboratory, which 
> servers around 200 laboratory members.  While our initial hardware purchase 
> has scaled well for the last 3 years, we are finding that we can't quite keep 
> up with rising the demand for compute-intensive jobs, such as mapping.  We 
> are hesitant to consider buying more hardware to support the load, since we 
> can't expect that solution to scale.
>
> Rather, we are attempting to set up Galaxy to queue jobs (especially mappers) 
> out to the lab's HPCC to accommodate the increasing load.  While there is a 
> good number of technical challenges involved in this strategy, I am only 
> writing to ask about one: data locality.
>
> Normally, all Galaxy datasets are stored directly on the private server 
> hosting our Galaxy instance.  The HPCC cannot mount our Galaxy server's 
> storage (ie: for the purpose of running jobs reading/writing datasets) for 
> security reasons.  However, we can mount a small portion of the HPCC file 
> system to our Galaxy server.  Storage on the HPCC is at a premium, so we 
> can't afford to just let newly created (or copied) datasets just sit there.  
> It follows that we need a mechanism for maintaining temporary storage in the 
> (restricted) HPCC space which allows for transfer of input datasets to the 
> HPCC (so they will be visible to jobs running there) and transfer of output 
> datasets back to persistent storage on our server.
>
> I am in the process of analyzing when/where/how exact path names are 
> substituted into tool command lines, looking for potential hooks to 
> facilitate the staging/unstaging of data before/after job execution on the 
> HPCC.  I have found a few places where I might try to insert logic for 
> handling this case.
>
> Before modifying too much of Galaxy's core code, I would like to know if 
> there is a recommended method for handling this situation and whether other 
> members of the Galaxy community have implemented fixes or workarounds for 
> this or similar data locality issues.  If you can offer either type of 
> information, I shall be m