On Tue, 15 Mar 2011, Erik Soyez wrote: ...
E-Mail: ------------------------------------------------------------------------ GE 6.2u5: Job 17765 failed ------------------------------------------------------------------------ : : failed assumedly before job:can not find an unused add_grp_id Shepherd pe_hostfile: xxxxxx208.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED xxxxxx207.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED xxxxxx205.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED xxxxxx204.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
...
We have seen error messages like this before. There seems to be a bug where some jobs do not get cleaned-up properly. Although the job ends, the execd still holds a GID for it, which can lead to an execd running out of GIDs. I've been meaning to look closely enough to submit a proper bug report.
I don't know if you're seeing the same problem, but you may want to try restarting the execd on your compute nodes with:
service sgeexecd softstop service sgeexecd start You should be able to do this while jobs are running. Mark -- ----------------------------------------------------------------- Mark Dixon Email : [email protected] HPC/Grid Systems Support Tel (int): 35429 Information Systems Services Tel (ext): +44(0)113 343 5429 University of Leeds, LS2 9JT, UK ----------------------------------------------------------------- _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
