[gridengine users] Removing qrsh from h_vmem resource limits?

Mark Dixon Mon, 26 Sep 2011 08:59:01 -0700

I've been looking at the issue where parallel jobs can be killed becausethey have exceeded h_vmem due to a large number of qrsh processes startedby the master task.


  https://arc.liv.ac.uk/trac/SGE/ticket/694

It turns out that you can opt-out of qrsh contributing to job resourcelimits, at least on Linux, by making qrsh remove any secondary groups inthe configured gid_range (man ge_conf) for your cluster.

I have successfully tested this by writing a small wrapper program toqrsh. The downside is that it relies on a SUID privileged call to the libcfunction "setgroups".

Has anyone already done this (by modifying qrsh directly or using awrapper) on a cluster? How have you got on?

Other than the security aspects surrounding a SUID binary, I can see thatthe big issue is the potential of running out of memory on the computenode if there are a very large number of compute nodes in the job.

I could be persuaded to write a patch to make this a configurable option,but would prefer first to have some discussion on what people on this listthink the correct behaviour would be...


Thanks,

Mark
--
-----------------------------------------------------------------
Mark Dixon                       Email    : [email protected]
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

[gridengine users] Removing qrsh from h_vmem resource limits?

Reply via email to