Re: [gridengine users] Automatically kill jobs on hung machines

William Hay Mon, 10 Jun 2013 07:59:14 -0700

On 10 June 2013 15:31, Neil Baker <[email protected]> wrote:


> Hi,
>
> Sorry my last post was a HTML formatted email, so this response is in
> plain text.
>
> Summary:
>
> We are currently running Sun Grid Engine 6.1u3 (old version that seems
> very stable so we've had no need to upgrade it).
>
> Can we configure SGE so that when the qmaster can't speak to an
> execution host, it assumes that it has hung / died and deletes the jobs
> on the queues on that machine? Ideally we'd like to specify a timeout
> for this.
>
> Details:
>
> Occasionally one of our execution host's OSor physical hardware hangs /
> fails (nothing to do with SGE).
>
> When this happens the jobs running on those machine also hang / fail,
> but as the qmaster can't speak to the execution hostit leaves the jobs
> in a run state in the qstat output.  These jobs appear to be running
> until the machine is fixed and powered back up into a working state,
> when the qmaster can speak to it again and it cleans up the failed jobs
> and removes them from the qstat output.
>
> Due to limited man powerin the 2 man IT team, we don't monitor to see if
> the qmaster can see the grid machines.  Only if they respond to pings.
> When they hang, sometimes theystill respond to pings although no jobs
> are running on themmaking spotting hung machines difficult.  Due to the
> nature of SGE, we don't want to care about the odd machine dying can the
> system as a whole can keep running without an individual execution host.
>
> However, some of our users have jobs that take almost a week to run.
> When a machine hangs, they still think their jobs are running correctly
> when they run qstat.  When a week has past and their job(s) still
> haven't finished they contact us and we see that the machine has hung
> and the user has wasted a few days (they could have resubmitted their
> job if they knew it had died).
>
> Can we configure SGE so that when the qmaster can't speak to an
> execution host, it assumes that it has hung / died and deletes the jobs
> on the queues on that machine? Ideally we'd like to specify a timeout
> for this.
>
> This way users will see that their jobs have finished and from their job
> output notice that it died.
>
> If this isn't possible with SGE6.1u3, is it possible with a later version?
>
> Thanks again for your help.
>
> Neil
>
> Not sure what version you need but in the grid engine config setting
reschedule_unknown to some value other than 0:0:0
and ENABLE_RESCHEDULE_KILL=true in qmaster_params should do this.

This says that it should try to reschedule reschedulable jobs if the host
is uncontactable for a certain amount of time and kill those it cant.
You might want to set -r n in the central sge_request file first to make
 sure people don't get jobs accidentally
rescheduled which can have adverse consequences if the job submitter didn't
anticipate this.  Despite what the docs say I have seen
jobs that don't specify -r either way get rescheduled when
reschedule_unknown expires. -r n seems to stop rescheduling fairly
thoroughly though.



> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users
>
>
>

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Automatically kill jobs on hung machines

Reply via email to