I'm working on my green support code and am seeing an issue where SGE appears to be killing all jobs with a messages like:
07/09/2011 02:14:07|worker|betsy-qmaster|E|[email protected] reports running job (16648.32/master) in queue "[email protected]" that was not supposed to be there - killing Has any one seen anything like this? This seems to be triggered when I power off (several?) compute nodes in a short period of time. The recent history of the nodes being powered off: Node is enabled and running a job Job finishes/is killed Green code notices extra idle nodes: disables queues on the nodes: 'qmod -d *@node' 'qconf -mattr exechost complex_values green_state="$timestamp: disabled" $node' Green code notices disabled queue and still no jobs: Updates state: 'qconf -mattr exechost complex_values green_state="$timestamp: _off" $node' power off nodes (Apparently) SGE notices the dead node(s) and kills incorrect jobs There are (currently) 2 minute delays between each of the steps above. Several nodes change states in each step with a few seconds between each action. Other Notes: Running SUN SGE 6.2u5. Only one queue instance on each node. Compute nodes are diskless and do not mount a shared sge_root. Power off used ipmi power off (no graceful shutdown). A complex string variable is used to track state. I have settings: load_report_time 00:00:40 max_unheard 00:05:00 reschedule_unknown 00:15:00 I think I have reproduced the problem manually without the complex variable. SGE queue empty, all nodes enabled and powered on qmaster restarted wait for all nodes to reappear in qstat -f output 3:17 qsub -t 1-800 do_sge_burnin 3:17 qstat -u \* 3:18 qdel 16651 -t 33-800 qmaster restarted; wait for qstat -f output to reset 3:25 qmod -d green@bc08\* ipmitool power off nodes 80-89 in another window 3:26 qstat -f -q green@\* | sort -u 3:26 qstat -u \* | sort 3:27 qstat -f -q green@\* | sort -u 3:27 qstat -u \* | sort The messages log file shows the jobs killed between 03:31:47 and 03:32:25 (approximately one job per second). Any thoughts or known bugs? Thanks, Stuart Barkley -- I've never been lost; I was once bewildered for three days, but never lost! -- Daniel Boone _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
