On 18 January 2013 14:41, HEREDIA GENESTAR, JOSE MARIA <
[email protected]> wrote:

>  Hi all,
>
> This morning our cluster front-end somehow filled its swap and the
> oom-killer decided to kill sge_qmaster. After that, I have tried to restart
> it, but it kept crashing like this every 5 minutes or so:
>
> Jan 18 14:13:51 floquet kernel: sge_qmaster[20914]: segfault at 8 ip
> 00000000005e4b39 sp 00007f77bd6f8b70 error 6 in sge_qmaster[400000+27d000]
>
> I have searched the archives and found a similar issue from 2010 (
> http://arc.liv.ac.uk/pipermail/gridengine-users/2010-March/029785.htmland 
> other reports). It seems to be a qmaster bug that should be fixed (
> http://markmail.org/thread/njkqj4byiqvye67i#query:+page:1+mid:njkqj4byiqvye67i+state:results).
> Nobody in those threads mentions a previous oom-killer crash before the
> repeating segfaults, but some report that this happens when
> submitting/running/finishing thightly-integrated parallel jobs. We usually
> don't run that kind of jobs in our cluster, but, preciselly today, one of
> our users is using them. Even though, it seems that some of those jobs have
> finished correctly before our first crash, so I don't know if these are
> related issues or it if is just a mere coincidence.
>
> How can we fix this? We are using SGE 6.2u5, the default binaries from
> rocks-cluster 6-0. Should I install a brand new version? Is there any
> sge_qmaster binary that fixes this and is compatible with 6.2u5? Is there
> any other way to fix it? I have seen some messages (
> http://arc.liv.ac.uk/pipermail/gridengine-users/2010-June/030775.html )
> stating to clean the spool directories. Does it really work? How should I
> do it? Should I stop all the cluster beforehand? What files should I delete?
>
> Thanks in advance,
>
> Txema
>
> could it be related to this:
https://arc.liv.ac.uk/trac/SGE/ticket/802
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to