For the mailing list archives:  I have more information about this
problem and a workaround which seems to be helping.

To summarize:

I'm seeing an issue where SGE appears to be killing all jobs with (in
the qmaster messages file):

  07/09/2011 02:14:07|worker|betsy-qmaster|E|[email protected] reports 
running job (16648.32/master) in queue "[email protected]" that was not 
supposed to be there - killing

All jobs are killed on all nodes in the cluster.  This occurs about 15
minutes after a node dies.

I have (qconf -sconf) settings:
  load_report_time             00:00:40
  max_unheard                  00:05:00
  reschedule_unknown           00:15:00
  qmaster_params               ENABLE_RESCHEDULE_KILL=true \
                               ENABLE_RESCHEDULE_SLAVE=true
Other Notes:
  Running SUN SGE 6.2u5.
  Compute nodes are diskless and do not mount a shared sge_root.

My partial solution was to restore reschedule_unknown and
qmaster_params to their default values:

  reschedule_unknown           00:00:00
  qmaster_params               none

This seems to have solved my immediate problem.  I changed both
variables and didn't attempt to see which specific setting was causing
the problem.

Remaining issue:

What remains is still the original problem which caused me to set
these variables in the first place.

If a node dies or is rebooted SGE does not do anything about hung jobs
when the node comes back online.  The jobs continue to appear in
the queue as if they where running.

This may be related to my using diskless nodes where the local spool
directory is cleared on reboot.  I will be looking into putting the
execd spool files on a shared directory in the future which may
address this problem.

Stuart Barkley
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to