Hi all,

This morning our cluster front-end somehow filled its swap and the
oom-killer decided to kill sge_qmaster. After that, I have tried to restart
it, but it kept crashing like this every 5 minutes or so:

Jan 18 14:13:51 floquet kernel: sge_qmaster[20914]: segfault at 8 ip
00000000005e4b39 sp 00007f77bd6f8b70 error 6 in sge_qmaster[400000+27d000]

I have searched the archives and found a similar issue from 2010 (
http://arc.liv.ac.uk/pipermail/gridengine-users/2010-March/029785.html and
other reports). It seems to be a qmaster bug that should be fixed (
http://markmail.org/thread/njkqj4byiqvye67i#query:+page:1+mid:njkqj4byiqvye67i+state:results).
Nobody in those threads mentions a previous oom-killer crash before the
repeating segfaults, but some report that this happens when
submitting/running/finishing thightly-integrated parallel jobs. We usually
don't run that kind of jobs in our cluster, but, preciselly today, one of
our users is using them. Even though, it seems that some of those jobs have
finished correctly before our first crash, so I don't know if these are
related issues or it if is just a mere coincidence.

How can we fix this? We are using SGE 6.2u5, the default binaries from
rocks-cluster 6-0. Should I install a brand new version? Is there any
sge_qmaster binary that fixes this and is compatible with 6.2u5? Is there
any other way to fix it? I have seen some messages (
http://arc.liv.ac.uk/pipermail/gridengine-users/2010-June/030775.html )
stating to clean the spool directories. Does it really work? How should I
do it? Should I stop all the cluster beforehand? What files should I delete?

Thanks in advance,

Txema
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to