"HEREDIA GENESTAR, JOSE MARIA" <[email protected]> writes:

> I have searched the archives and found a similar issue from 2010 (
> http://arc.liv.ac.uk/pipermail/gridengine-users/2010-March/029785.html and
> other reports). It seems to be a qmaster bug that should be fixed (
> http://markmail.org/thread/njkqj4byiqvye67i#query:+page:1+mid:njkqj4byiqvye67i+state:results).

If it's that bug <https://arc.liv.ac.uk/trac/SGE/ticket/789> several
responses from me should have shown up...

> Nobody in those threads mentions a previous oom-killer crash before the
> repeating segfaults, but some report that this happens when
> submitting/running/finishing thightly-integrated parallel jobs. We usually
> don't run that kind of jobs in our cluster, but, preciselly today, one of
> our users is using them. Even though, it seems that some of those jobs have
> finished correctly before our first crash, so I don't know if these are
> related issues or it if is just a mere coincidence.
>
> How can we fix this? We are using SGE 6.2u5, the default binaries from
> rocks-cluster 6-0. Should I install a brand new version?

If you install http://arc.liv.ac.uk/downloads/SGE/releases/8.1.2/ you'll
get a fix for that and hundreds of other improvements.

> Is there any
> sge_qmaster binary that fixes this and is compatible with 6.2u5? 

I don't see much point in just replacing the qmaster and not upgrading,
but http://arc.liv.ac.uk/downloads/SGE/packages/RH5/ was what I was
running at one time, particularly to fix that.

-- 
Community Grid Engine:  http://arc.liv.ac.uk/SGE/
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to