Hi all, This morning our cluster front-end somehow filled its swap and the oom-killer decided to kill sge_qmaster. After that, I have tried to restart it, but it kept crashing like this every 5 minutes or so:
Jan 18 14:13:51 floquet kernel: sge_qmaster[20914]: segfault at 8 ip 00000000005e4b39 sp 00007f77bd6f8b70 error 6 in sge_qmaster[400000+27d000] I have searched the archives and found a similar issue from 2010 ( http://arc.liv.ac.uk/pipermail/gridengine-users/2010-March/029785.html and other reports). It seems to be a qmaster bug that should be fixed ( http://markmail.org/thread/njkqj4byiqvye67i#query:+page:1+mid:njkqj4byiqvye67i+state:results). Nobody in those threads mentions a previous oom-killer crash before the repeating segfaults, but some report that this happens when submitting/running/finishing thightly-integrated parallel jobs. We usually don't run that kind of jobs in our cluster, but, preciselly today, one of our users is using them. Even though, it seems that some of those jobs have finished correctly before our first crash, so I don't know if these are related issues or it if is just a mere coincidence. How can we fix this? We are using SGE 6.2u5, the default binaries from rocks-cluster 6-0. Should I install a brand new version? Is there any sge_qmaster binary that fixes this and is compatible with 6.2u5? Is there any other way to fix it? I have seen some messages ( http://arc.liv.ac.uk/pipermail/gridengine-users/2010-June/030775.html ) stating to clean the spool directories. Does it really work? How should I do it? Should I stop all the cluster beforehand? What files should I delete? Thanks in advance, Txema
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
