On Tue, 25 Oct 2011 at 6:12pm, SLIM H.A. wrote

After using GridEngine 6.1u6 for more than a year a problem has cropped
up suddenly with the scheduler. The scheduler uses rapidly all the
available memory in the system and can ultimately crash the server.
Stopping qmaster, waiting until top shows a normal memory usage and
restarting it, immediately all memory is claimed by sge_schedd. I have
tried setting the params  profile=1 setting with qconf -msconf to
monitor the scheduler message file, the output after restarting qmaster
is below. I cannot see anything relevant but maybe someone else has a
better insight.

Does anyone know another way to investigate this "memory leak"?

I recently dealt with a similar problem on 6.1u3. I tracked it down to a single job -- a 50,000 task array job with a very poorly written job script which clocked in at over 32MB. Putting a hold on that job settled SGE back into sane amounts of memory usage. I then gently encouraged the user to rewrite the job script.

One way to track down which job(s) is/are causing the issue is to put a hold on all queued jobs. Take the hold off in batches and track down the errant job(s).

Good luck.

--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to