Hi,

Summary:

We are currently running Sun Grid Engine 6.1u3 (old version that seems very stable so we've had no need to upgrade it).

Can we configure SGE so that when the qmaster can't speak to an execution host, it assumes that it has hung / died and deletes the jobs on the queues on that machine?  Ideally we'd like to specify a timeout for this.

Details:

Occasionally one of our execution host's OS or physical hardware hangs / fails (nothing to do with SGE).

When this happens the jobs running on those machine also hang / fail, but as the qmaster can't speak to the execution host it leaves the jobs in a run state in the qstat output.  These jobs appear to be running until the machine is fixed and powered back up into a working state, when the qmaster can speak to it again and it cleans up the failed jobs and removes them from the qstat output.

Due to limited man power in the 2 man IT team, we don't monitor to see if the qmaster can see the grid machines.  Only if they respond to pings.  When they hang, sometimes they still respond to pings although no jobs are running on them making spotting hung machines difficult.  Due to the nature of SGE, we don't want to care about the odd machine dying can the system as a whole can keep running without an individual execution host.

However, some of our users have jobs that take almost a week to run.  When a machine hangs, they still think their jobs are running correctly when they run qstat.  When a week has past and their job(s) still haven't finished they contact us and we see that the machine has hung and the user has wasted a few days (they could have resubmitted their job if they knew it had died).

Can we configure SGE so that when the qmaster can't speak to an execution host, it assumes that it has hung / died and deletes the jobs on the queues on that machine?  Ideally we'd like to specify a timeout for this.

This way users will see that their jobs have finished and from their job output notice that it died.

If this isn't possible with SGE6.1u3, is it possible with a later version?

Thanks again for your help.

Neil







-- 
======================================================================
NOTE: The information in this email and any attachments may be
confidential and/or legally privileged. This message may be read, copied
and used only by the intended recipient. If you are not the intended
recipient, please destroy this message, delete any copies held on your
system and notify the sender immediately.

Toshiba Research Europe Limited, registered in England and Wales (2519556).
Registered Office: 208 Cambridge Science Park, Milton Road, Cambridge
CB4 0GZ, England. Web: http://www.toshiba-europe.com/research/crl
======================================================================

______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to