[I suspect it's only really worth asking this sort of thing on
[email protected] now.]

William Hay <[email protected]> writes:

> The qmaster process on our cluster fell over and died  this morning.
> After a restart it seems to be working fine.  At present we don't have
> a shadow master host in place.  The sge_shadowd  man page seems to be
> written from the perspective of being run on a different host but I
> was wondering if one could run it on one's normal master host in order
> to restart the qmaster process where that had died?

Assuming you don't have a very high throughput, at least, it's probably
simpler just to restart it with something like Nagios (if you're already
monitoring from the right place), or run it under monit or similar.  I
used monit (for no particular reason) to get round continual crashes,
though I also have a Nagios monitor.

  # cat /etc/monit.d/qmaster
  check process sge_qmaster with pidfile /opt/sge/lv3/spool/qmaster/qmaster.pid
    start program = "/etc/init.d/sgemaster.lv3 start"
    stop program = "/etc/init.d/sgemaster.lv3 stop"
  
> The qmaster died with the following error in /var/log/messages in case
> anyone thinks it is relevant:
> qkernel: sge_qmaster[29609] general protection rip:56d24d rsp:48335a90 error:0

If you're running tightly integrated parallel jobs under 6.2u4, 6.2u5,
and possibly earlier, try the patch at
https://arc.liv.ac.uk/trac/SGE/changeset/3511/sge/source/daemons/qmaster/sge_sched_process_events.c
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to