Thanks William.
It's somehow related, but not exactly. The parallel jobs that crash the
system are not using tasks:
qsub -N job1 ...
qsub -N job2 -hold_jid job1 ...
qsub -N job3 -hold_jid job2 -pe threaded 1-50 ...
qsub -N job4 -hold_jid job2 -pe threaded 1-50 ...
They are single jobs and are using -hold_jid, not -hold_jid_ad
Txema
El 18/01/13 16:19, William Hay escribió:
On 18 January 2013 14:41, HEREDIA GENESTAR, JOSE MARIA
<[email protected] <mailto:[email protected]>> wrote:
Hi all,
This morning our cluster front-end somehow filled its swap and the
oom-killer decided to kill sge_qmaster. After that, I have tried
to restart it, but it kept crashing like this every 5 minutes or so:
Jan 18 14:13:51 floquet kernel: sge_qmaster[20914]: segfault at 8
ip 00000000005e4b39 sp 00007f77bd6f8b70 error 6 in
sge_qmaster[400000+27d000]
I have searched the archives and found a similar issue from 2010 (
http://arc.liv.ac.uk/pipermail/gridengine-users/2010-March/029785.html
and other reports). It seems to be a qmaster bug that should be
fixed (
http://markmail.org/thread/njkqj4byiqvye67i#query:+page:1+mid:njkqj4byiqvye67i+state:results
).
Nobody in those threads mentions a previous oom-killer crash
before the repeating segfaults, but some report that this happens
when submitting/running/finishing thightly-integrated parallel
jobs. We usually don't run that kind of jobs in our cluster, but,
preciselly today, one of our users is using them. Even though, it
seems that some of those jobs have finished correctly before our
first crash, so I don't know if these are related issues or it if
is just a mere coincidence.
How can we fix this? We are using SGE 6.2u5, the default binaries
from rocks-cluster 6-0. Should I install a brand new version? Is
there any sge_qmaster binary that fixes this and is compatible
with 6.2u5? Is there any other way to fix it? I have seen some
messages (
http://arc.liv.ac.uk/pipermail/gridengine-users/2010-June/030775.html
) stating to clean the spool directories. Does it really work? How
should I do it? Should I stop all the cluster beforehand? What
files should I delete?
Thanks in advance,
Txema
could it be related to this:
https://arc.liv.ac.uk/trac/SGE/ticket/802
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users