Re: [gridengine users] Automatically kill jobs on hung machines

Neil Baker Tue, 11 Jun 2013 01:04:54 -0700

Hi Tina,

Thank you for bringing this plugin to our attention. Its simply a case of drowning under IT tasks and this grid engine monitoring being a lesser priority.

Even with this plugin in place, although the IT team will be notified about the SGE daemon dying, the staff still won't be aware. At least we will be notified when it dies, rather than waiting for users to shout at us ;)

Neil

======================================================================
NOTE: The information in this email and any attachments may be
confidential and/or legally privileged. This message may be read, copied
and used only by the intended recipient. If you are not the intended
recipient, please destroy this message, delete any copies held on your
system and notify the sender immediately.


Toshiba Research Europe Limited, registered in England and Wales (2519556).
Registered Office: 208 Cambridge Science Park, Milton Road, Cambridge
CB4 0GZ, England. Web: http://www.toshiba-europe.com/research/crl
======================================================================

On 10/06/2013 16:34, Tina Friedrich wrote:

Related, but not in direct answer - what sort of monitoring do you use, and why not (also) monitor for SGE daemon status? (Nagios plugin already exists).

Tina

On 10/06/13 15:57, William Hay wrote:

On 10 June 2013 15:31, Neil Baker <[email protected]
<mailto:[email protected]>> wrote:

    Hi,

    Sorry my last post was a HTML formatted email, so this response is in
    plain text.

    Summary:

    We are currently running Sun Grid Engine 6.1u3 (old version that seems
    very stable so we've had no need to upgrade it).

    Can we configure SGE so that when the qmaster can't speak to an
    execution host, it assumes that it has hung / died and deletes the jobs
    on the queues on that machine? Ideally we'd like to specify a timeout
    for this.

    Details:

    Occasionally one of our execution host's OSor physical hardware hangs /
    fails (nothing to do with SGE).

    When this happens the jobs running on those machine also hang / fail,
    but as the qmaster can't speak to the execution hostit leaves the jobs
    in a run state in the qstat output. These jobs appear to be running
    until the machine is fixed and powered back up into a working state,
    when the qmaster can speak to it again and it cleans up the failed jobs
    and removes them from the qstat output.

    Due to limited man powerin the 2 man IT team, we don't monitor to see if
    the qmaster can see the grid machines. Only if they respond to pings.
    When they hang, sometimes theystill respond to pings although no jobs
    are running on themmaking spotting hung machines difficult. Due to the
    nature of SGE, we don't want to care about the odd machine dying can the
    system as a whole can keep running without an individual execution host.

    However, some of our users have jobs that take almost a week to run.
    When a machine hangs, they still think their jobs are running correctly
    when they run qstat. When a week has past and their job(s) still
    haven't finished they contact us and we see that the machine has hung
    and the user has wasted a few days (they could have resubmitted their
    job if they knew it had died).

    Can we configure SGE so that when the qmaster can't speak to an
    execution host, it assumes that it has hung / died and deletes the jobs
    on the queues on that machine? Ideally we'd like to specify a timeout
    for this.

    This way users will see that their jobs have finished and from their job
    output notice that it died.

    If this isn't possible with SGE6.1u3, is it possible with a later
    version?

    Thanks again for your help.

    Neil

Not sure what version you need but in the grid engine config setting
reschedule_unknown to some value other than 0:0:0
and ENABLE_RESCHEDULE_KILL=true in qmaster_params should do this.

This says that it should try to reschedule reschedulable jobs if the
host is uncontactable for a certain amount of time and kill those it cant.
You might want to set -r n in the central sge_request file first to make
sure people don't get jobs accidentally
rescheduled which can have adverse consequences if the job submitter
didn't anticipate this. Despite what the docs say I have seen
jobs that don't specify -r either way get rescheduled when
reschedule_unknown expires. -r n seems to stop rescheduling fairly
thoroughly though.

    ______________________________________________________________________
    This email has been scanned by the Symantec Email Security.cloud
    service.
    For more information please visit http://www.symanteccloud.com
    ______________________________________________________________________
    _______________________________________________
    users mailing list
    [email protected] <mailto:[email protected]>
    https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Automatically kill jobs on hung machines

Reply via email to