On Tue, 15 Mar 2011, Erik Soyez wrote:
...
E-Mail:
------------------------------------------------------------------------
GE 6.2u5: Job 17765 failed
------------------------------------------------------------------------
                :
                :
failed assumedly before job:can not find an unused add_grp_id
Shepherd pe_hostfile:
xxxxxx208.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
xxxxxx207.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
xxxxxx205.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
xxxxxx204.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
...

We have seen error messages like this before. There seems to be a bug where some jobs do not get cleaned-up properly. Although the job ends, the execd still holds a GID for it, which can lead to an execd running out of GIDs. I've been meaning to look closely enough to submit a proper bug report.

I don't know if you're seeing the same problem, but you may want to try restarting the execd on your compute nodes with:

  service sgeexecd softstop
  service sgeexecd start

You should be able to do this while jobs are running.

Mark
--
-----------------------------------------------------------------
Mark Dixon                       Email    : [email protected]
HPC/Grid Systems Support         Tel (int): 35429
Information Systems Services     Tel (ext): +44(0)113 343 5429
University of Leeds, LS2 9JT, UK
-----------------------------------------------------------------
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to