On 18 January 2013 14:41, HEREDIA GENESTAR, JOSE MARIA < [email protected]> wrote:
> Hi all, > > This morning our cluster front-end somehow filled its swap and the > oom-killer decided to kill sge_qmaster. After that, I have tried to restart > it, but it kept crashing like this every 5 minutes or so: > > Jan 18 14:13:51 floquet kernel: sge_qmaster[20914]: segfault at 8 ip > 00000000005e4b39 sp 00007f77bd6f8b70 error 6 in sge_qmaster[400000+27d000] > > I have searched the archives and found a similar issue from 2010 ( > http://arc.liv.ac.uk/pipermail/gridengine-users/2010-March/029785.htmland > other reports). It seems to be a qmaster bug that should be fixed ( > http://markmail.org/thread/njkqj4byiqvye67i#query:+page:1+mid:njkqj4byiqvye67i+state:results). > Nobody in those threads mentions a previous oom-killer crash before the > repeating segfaults, but some report that this happens when > submitting/running/finishing thightly-integrated parallel jobs. We usually > don't run that kind of jobs in our cluster, but, preciselly today, one of > our users is using them. Even though, it seems that some of those jobs have > finished correctly before our first crash, so I don't know if these are > related issues or it if is just a mere coincidence. > > How can we fix this? We are using SGE 6.2u5, the default binaries from > rocks-cluster 6-0. Should I install a brand new version? Is there any > sge_qmaster binary that fixes this and is compatible with 6.2u5? Is there > any other way to fix it? I have seen some messages ( > http://arc.liv.ac.uk/pipermail/gridengine-users/2010-June/030775.html ) > stating to clean the spool directories. Does it really work? How should I > do it? Should I stop all the cluster beforehand? What files should I delete? > > Thanks in advance, > > Txema > > could it be related to this: https://arc.liv.ac.uk/trac/SGE/ticket/802
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
