Thanks William.

It's somehow related, but not exactly. The parallel jobs that crash the system are not using tasks:

qsub -N job1 ...
qsub -N job2 -hold_jid job1 ...
qsub -N job3 -hold_jid job2 -pe threaded 1-50 ...
qsub -N job4 -hold_jid job2 -pe threaded 1-50 ...

They are single jobs and are using -hold_jid, not -hold_jid_ad


Txema

El 18/01/13 16:19, William Hay escribió:




On 18 January 2013 14:41, HEREDIA GENESTAR, JOSE MARIA <[email protected] <mailto:[email protected]>> wrote:

    Hi all,

    This morning our cluster front-end somehow filled its swap and the
    oom-killer decided to kill sge_qmaster. After that, I have tried
    to restart it, but it kept crashing like this every 5 minutes or so:

    Jan 18 14:13:51 floquet kernel: sge_qmaster[20914]: segfault at 8
    ip 00000000005e4b39 sp 00007f77bd6f8b70 error 6 in
    sge_qmaster[400000+27d000]

    I have searched the archives and found a similar issue from 2010 (
    http://arc.liv.ac.uk/pipermail/gridengine-users/2010-March/029785.html
    and other reports). It seems to be a qmaster bug that should be
    fixed (
    
http://markmail.org/thread/njkqj4byiqvye67i#query:+page:1+mid:njkqj4byiqvye67i+state:results
    ).
    Nobody in those threads mentions a previous oom-killer crash
    before the repeating segfaults, but some report that this happens
    when submitting/running/finishing thightly-integrated parallel
    jobs. We usually don't run that kind of jobs in our cluster, but,
    preciselly today, one of our users is using them. Even though, it
    seems that some of those jobs have finished correctly before our
    first crash, so I don't know if these are related issues or it if
    is just a mere coincidence.

    How can we fix this? We are using SGE 6.2u5, the default binaries
    from rocks-cluster 6-0. Should I install a brand new version? Is
    there any sge_qmaster binary that fixes this and is compatible
    with 6.2u5? Is there any other way to fix it? I have seen some
    messages (
    http://arc.liv.ac.uk/pipermail/gridengine-users/2010-June/030775.html
    ) stating to clean the spool directories. Does it really work? How
    should I do it? Should I stop all the cluster beforehand? What
    files should I delete?

    Thanks in advance,

    Txema

could it be related to this:
https://arc.liv.ac.uk/trac/SGE/ticket/802




_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to