On 10 June 2013 15:31, Neil Baker <[email protected]> wrote:
> Hi, > > Sorry my last post was a HTML formatted email, so this response is in > plain text. > > Summary: > > We are currently running Sun Grid Engine 6.1u3 (old version that seems > very stable so we've had no need to upgrade it). > > Can we configure SGE so that when the qmaster can't speak to an > execution host, it assumes that it has hung / died and deletes the jobs > on the queues on that machine? Ideally we'd like to specify a timeout > for this. > > Details: > > Occasionally one of our execution host's OSor physical hardware hangs / > fails (nothing to do with SGE). > > When this happens the jobs running on those machine also hang / fail, > but as the qmaster can't speak to the execution hostit leaves the jobs > in a run state in the qstat output. These jobs appear to be running > until the machine is fixed and powered back up into a working state, > when the qmaster can speak to it again and it cleans up the failed jobs > and removes them from the qstat output. > > Due to limited man powerin the 2 man IT team, we don't monitor to see if > the qmaster can see the grid machines. Only if they respond to pings. > When they hang, sometimes theystill respond to pings although no jobs > are running on themmaking spotting hung machines difficult. Due to the > nature of SGE, we don't want to care about the odd machine dying can the > system as a whole can keep running without an individual execution host. > > However, some of our users have jobs that take almost a week to run. > When a machine hangs, they still think their jobs are running correctly > when they run qstat. When a week has past and their job(s) still > haven't finished they contact us and we see that the machine has hung > and the user has wasted a few days (they could have resubmitted their > job if they knew it had died). > > Can we configure SGE so that when the qmaster can't speak to an > execution host, it assumes that it has hung / died and deletes the jobs > on the queues on that machine? Ideally we'd like to specify a timeout > for this. > > This way users will see that their jobs have finished and from their job > output notice that it died. > > If this isn't possible with SGE6.1u3, is it possible with a later version? > > Thanks again for your help. > > Neil > > Not sure what version you need but in the grid engine config setting reschedule_unknown to some value other than 0:0:0 and ENABLE_RESCHEDULE_KILL=true in qmaster_params should do this. This says that it should try to reschedule reschedulable jobs if the host is uncontactable for a certain amount of time and kill those it cant. You might want to set -r n in the central sge_request file first to make sure people don't get jobs accidentally rescheduled which can have adverse consequences if the job submitter didn't anticipate this. Despite what the docs say I have seen jobs that don't specify -r either way get rescheduled when reschedule_unknown expires. -r n seems to stop rescheduling fairly thoroughly though. > ______________________________________________________________________ > This email has been scanned by the Symantec Email Security.cloud service. > For more information please visit http://www.symanteccloud.com > ______________________________________________________________________ > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users > > >
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
