Mark, thanks a lot for your reply!
Do you have any idea, under which circumstances that happens or what
configuration details could be responsable? Did you use tight mpi
integration (the problem has never occured before with loose mpi
integration)? Although (re-soft-)starting the execds helped, it could
also be a qmaster problem, because it hit the entire cluster within a
few hours. Or maybe each execd had just run the 200th job after some
time (which means that it will happen again after the next 200 jobs
on each node). I might experiment with smaller gid ranges and see
if it happens any sooner.
Erik Soyez.
On Tue, 15 Mar 2011, Mark Dixon wrote:
On Tue, 15 Mar 2011, Erik Soyez wrote:
...
E-Mail:
------------------------------------------------------------------------
GE 6.2u5: Job 17765 failed
------------------------------------------------------------------------
:
:
failed assumedly before job:can not find an unused add_grp_id
Shepherd pe_hostfile:
xxxxxx208.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
xxxxxx207.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
xxxxxx205.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
xxxxxx204.xxxxx.xxxxx.xxx 4 [email protected] UNDEFINED
...
We have seen error messages like this before. There seems to be a bug where
some jobs do not get cleaned-up properly. Although the job ends, the execd
still holds a GID for it, which can lead to an execd running out of GIDs.
I've been meaning to look closely enough to submit a proper bug report.
I don't know if you're seeing the same problem, but you may want to try
restarting the execd on your compute nodes with:
service sgeexecd softstop
service sgeexecd start
You should be able to do this while jobs are running.
--
____________________________________________creating IT solutions
Erik Soyez science + computing ag
IT-Services teamline SRS +49 7071 9457 692
phone +49 7071 9457 583 teamline Bosch +49 7071 9457 687
www.science-computing.de teamline Porsche +49 7071 9457 686
--
Vorstand/Board of Management:
Dr. Bernd Finkbeiner, Dr. Roland Niemeier,
Dr. Arno Steitz, Dr. Ingrid Zech
Vorsitzender des Aufsichtsrats/
Chairman of the Supervisory Board:
Michel Lepert
Sitz/Registered Office: Tuebingen
Registergericht/Registration Court: Stuttgart
Registernummer/Commercial Register No.: HRB 382196
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users