On Mar 13, 2012, at 6:59 AM, David Matthews wrote:

> Hi,
> 
> We emailed previously about possible memory leaks in our installation of 
> Galaxy here on the HPC at Bristol. We can run Galaxy just fine on our login 
> node but when we integrate into the cluster using pbs job runner the whole 
> thing falls over - almost certainly due to a memory leak. In essence, every 
> attempt to submit a TopHat job (with 2x5GB paired end reads to the full human 
> genome) always results in the whole thing falling over - but not when Galaxy 
> is restricted to the login node. 
> We saw that Nate responded to Todd Oakley about a week ago saying that there 
> is a memory leak in libtorque or pbs_python when using the pbs job runner. 
> Have there been any developments on this ?
> 
> Best Wishes,
> David.

Hi David,

I am almost certain that the problem you have with tophat is not due to the 
same leak, since it's a slow leak, not an immediate spike.  Before we go any 
further, in reading back over our past conversation about this problem, I 
noticed that I never asked whether you've set `set_metadata_externally = True` 
in your Galaxy config.  If not, this is almost certainly the cause of the 
problem.

If you're already setting metadata externally, answers to a few of the 
questions I asked last time (or perhaps any findings of your HPC guys) and a 
few new things to try would be helpful in figuring out why your tophat jobs 
still crash:

1. Create a separate job runner and web frontend so we can be sure that the job 
running portion is the memory culprit:

    
http://wiki.g2.bx.psu.edu/Admin/Config/Performance/Web%20Application%20Scaling

  You would not need any of the load balancing config, just start a single web 
process and a single runner process.  From reading your prior email I believe 
you have a proxy server, and so as long as you start the web process on the 
same port as your previous Galaxy server, no change would be needed to your 
proxy server.

2. Set use_heartbeat = True in the config file of whichever process is 
consuming all of the memory.

3. Does the MemoryError appear in the log after Galaxy has noticed that the job 
has finished on the cluster (`(<id>/<pbs id>) PBS job has left queue`), but 
before the job post-processing is finished (`job <id> ended`)?

4. Does the MemoryError appear regardless of whether anyone accesses the web 
interface?

There is another memory consumption problem we'll look at soon, which occurs 
when the job runner reads the metadata files written by the external 
set_metadata tool.  If the output dataset(s) have an extremely large number of 
columns, this can cause a very large, nearly immediate memory spike when job 
post-processing begins, even if the output file itself is relatively small.

--nate

> 
> __________________________________
> Dr David A. Matthews
> 
> Senior Lecturer in Virology
> Room E49
> Department of Cellular and Molecular Medicine,
> School of Medical Sciences
> University Walk,
> University of Bristol
> Bristol.
> BS8 1TD
> U.K.
> 
> Tel. +44 117 3312058
> Fax. +44 117 3312091
> 
> d.a.matth...@bristol.ac.uk
> 
> 
> 
> 
> 
> 
> ___________________________________________________________
> Please keep all replies on the list by using "reply all"
> in your mail client.  To manage your subscriptions to this
> and other Galaxy lists, please use the interface at:
> 
>  http://lists.bx.psu.edu/


___________________________________________________________
Please keep all replies on the list by using "reply all"
in your mail client.  To manage your subscriptions to this
and other Galaxy lists, please use the interface at:

  http://lists.bx.psu.edu/

Reply via email to