Hi Tina,
Thank you for bringing this
plugin to our attention. Its simply a case of drowning
under IT tasks and this grid engine
monitoring being a lesser priority.
Even with this plugin in place, although
the IT team will be notified about the
SGE daemon dying, the staff still won't be aware. At
least we will be notified when it dies, rather than
waiting for users to shout at us ;)
Neil
======================================================================
NOTE: The information in this email and any attachments may be
confidential and/or legally privileged. This message may be read, copied
and used only by the intended recipient. If you are not the intended
recipient, please destroy this message, delete any copies held on your
system and notify the sender immediately.
Toshiba Research Europe Limited, registered in England and Wales (2519556).
Registered Office: 208 Cambridge Science Park, Milton Road, Cambridge
CB4 0GZ, England. Web: http://www.toshiba-europe.com/research/crl
======================================================================
On 10/06/2013 16:34, Tina Friedrich wrote:
Related,
but not in direct answer - what sort of monitoring do you use, and
why not (also) monitor for SGE daemon status? (Nagios plugin
already exists).
Tina
On 10/06/13 15:57, William Hay wrote:
On 10 June 2013 15:31, Neil Baker
<[email protected]
<mailto:[email protected]>> wrote:
Hi,
Sorry my last post was a HTML formatted email, so this
response is in
plain text.
Summary:
We are currently running Sun Grid Engine 6.1u3 (old version
that seems
very stable so we've had no need to upgrade it).
Can we configure SGE so that when the qmaster can't speak to
an
execution host, it assumes that it has hung / died and
deletes the jobs
on the queues on that machine? Ideally we'd like to specify
a timeout
for this.
Details:
Occasionally one of our execution host's OSor physical
hardware hangs /
fails (nothing to do with SGE).
When this happens the jobs running on those machine also
hang / fail,
but as the qmaster can't speak to the execution hostit
leaves the jobs
in a run state in the qstat output. These jobs appear to be
running
until the machine is fixed and powered back up into a
working state,
when the qmaster can speak to it again and it cleans up the
failed jobs
and removes them from the qstat output.
Due to limited man powerin the 2 man IT team, we don't
monitor to see if
the qmaster can see the grid machines. Only if they respond
to pings.
When they hang, sometimes theystill respond to pings
although no jobs
are running on themmaking spotting hung machines difficult.
Due to the
nature of SGE, we don't want to care about the odd machine
dying can the
system as a whole can keep running without an individual
execution host.
However, some of our users have jobs that take almost a week
to run.
When a machine hangs, they still think their jobs are
running correctly
when they run qstat. When a week has past and their job(s)
still
haven't finished they contact us and we see that the machine
has hung
and the user has wasted a few days (they could have
resubmitted their
job if they knew it had died).
Can we configure SGE so that when the qmaster can't speak to
an
execution host, it assumes that it has hung / died and
deletes the jobs
on the queues on that machine? Ideally we'd like to specify
a timeout
for this.
This way users will see that their jobs have finished and
from their job
output notice that it died.
If this isn't possible with SGE6.1u3, is it possible with a
later
version?
Thanks again for your help.
Neil
Not sure what version you need but in the grid engine config
setting
reschedule_unknown to some value other than 0:0:0
and ENABLE_RESCHEDULE_KILL=true in qmaster_params should do
this.
This says that it should try to reschedule reschedulable jobs if
the
host is uncontactable for a certain amount of time and kill
those it cant.
You might want to set -r n in the central sge_request file first
to make
sure people don't get jobs accidentally
rescheduled which can have adverse consequences if the job
submitter
didn't anticipate this. Despite what the docs say I have seen
jobs that don't specify -r either way get rescheduled when
reschedule_unknown expires. -r n seems to stop rescheduling
fairly
thoroughly though.
______________________________________________________________________
This email has been scanned by the Symantec Email
Security.cloud
service.
For more information please visit
http://www.symanteccloud.com
______________________________________________________________________
_______________________________________________
users mailing list
[email protected] <mailto:[email protected]>
https://gridengine.org/mailman/listinfo/users
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users
______________________________________________________________________
This email has been scanned by the Symantec Email Security.cloud service.
For more information please visit http://www.symanteccloud.com
______________________________________________________________________
|