Thanks for the answer Dave.

I have tried your old rpms and it didn't work. The problem still persists and the qmaster crashes.

I'll see if I can schedule a maintenance stop to upgrade the system to a newer version.

Txema

El 18/01/13 17:26, Dave Love escribió:
"HEREDIA GENESTAR, JOSE MARIA" <[email protected]> writes:

I have searched the archives and found a similar issue from 2010 (
http://arc.liv.ac.uk/pipermail/gridengine-users/2010-March/029785.html and
other reports). It seems to be a qmaster bug that should be fixed (
http://markmail.org/thread/njkqj4byiqvye67i#query:+page:1+mid:njkqj4byiqvye67i+state:results).
If it's that bug <https://arc.liv.ac.uk/trac/SGE/ticket/789> several
responses from me should have shown up...

Nobody in those threads mentions a previous oom-killer crash before the
repeating segfaults, but some report that this happens when
submitting/running/finishing thightly-integrated parallel jobs. We usually
don't run that kind of jobs in our cluster, but, preciselly today, one of
our users is using them. Even though, it seems that some of those jobs have
finished correctly before our first crash, so I don't know if these are
related issues or it if is just a mere coincidence.

How can we fix this? We are using SGE 6.2u5, the default binaries from
rocks-cluster 6-0. Should I install a brand new version?
If you install http://arc.liv.ac.uk/downloads/SGE/releases/8.1.2/ you'll
get a fix for that and hundreds of other improvements.

Is there any
sge_qmaster binary that fixes this and is compatible with 6.2u5?
I don't see much point in just replacing the qmaster and not upgrading,
but http://arc.liv.ac.uk/downloads/SGE/packages/RH5/ was what I was
running at one time, particularly to fix that.


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to