Re: [gridengine users] Fwd: qmaster segmentation fault

Txema Heredia Genestar Fri, 18 Jan 2013 07:38:22 -0800

Thanks William.

It's somehow related, but not exactly. The parallel jobs that crash thesystem are not using tasks:


qsub -N job1 ...
qsub -N job2 -hold_jid job1 ...
qsub -N job3 -hold_jid job2 -pe threaded 1-50 ...
qsub -N job4 -hold_jid job2 -pe threaded 1-50 ...

They are single jobs and are using -hold_jid, not -hold_jid_ad


Txema

El 18/01/13 16:19, William Hay escribió:

On 18 January 2013 14:41, HEREDIA GENESTAR, JOSE MARIA<[email protected] <mailto:[email protected]>> wrote:

Hi all,

This morning our cluster front-end somehow filled its swap and the
oom-killer decided to kill sge_qmaster. After that, I have tried
to restart it, but it kept crashing like this every 5 minutes or so:

Jan 18 14:13:51 floquet kernel: sge_qmaster[20914]: segfault at 8
ip 00000000005e4b39 sp 00007f77bd6f8b70 error 6 in
sge_qmaster[400000+27d000]

I have searched the archives and found a similar issue from 2010 (
http://arc.liv.ac.uk/pipermail/gridengine-users/2010-March/029785.html
and other reports). It seems to be a qmaster bug that should be
fixed (

http://markmail.org/thread/njkqj4byiqvye67i#query:+page:1+mid:njkqj4byiqvye67i+state:results
).
Nobody in those threads mentions a previous oom-killer crash
before the repeating segfaults, but some report that this happens
when submitting/running/finishing thightly-integrated parallel
jobs. We usually don't run that kind of jobs in our cluster, but,
preciselly today, one of our users is using them. Even though, it
seems that some of those jobs have finished correctly before our
first crash, so I don't know if these are related issues or it if
is just a mere coincidence.

How can we fix this? We are using SGE 6.2u5, the default binaries
from rocks-cluster 6-0. Should I install a brand new version? Is
there any sge_qmaster binary that fixes this and is compatible
with 6.2u5? Is there any other way to fix it? I have seen some
messages (
http://arc.liv.ac.uk/pipermail/gridengine-users/2010-June/030775.html
) stating to clean the spool directories. Does it really work? How
should I do it? Should I stop all the cluster beforehand? What
files should I delete?

Thanks in advance,

Txema

could it be related to this:
https://arc.liv.ac.uk/trac/SGE/ticket/802

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Fwd: qmaster segmentation fault

Reply via email to