Am 09.07.2011 um 10:20 schrieb Stuart Barkley: > I'm working on my green support code and am seeing an issue where SGE > appears to be killing all jobs with a messages like: > > 07/09/2011 02:14:07|worker|betsy-qmaster|E|[email protected] reports > running job (16648.32/master) in queue "[email protected]" that was not > supposed to be there - killing
SGE thinks the node is reporting something despite the fact that it's switched off? > Has any one seen anything like this? > > This seems to be triggered when I power off (several?) compute nodes > in a short period of time. > > The recent history of the nodes being powered off: > Node is enabled and running a job > Job finishes/is killed > Green code notices extra idle nodes: disables queues on the nodes: > 'qmod -d *@node' > 'qconf -mattr exechost complex_values green_state="$timestamp: disabled" > $node' > Green code notices disabled queue and still no jobs: Updates state: > 'qconf -mattr exechost complex_values green_state="$timestamp: _off" $node' > power off nodes > (Apparently) SGE notices the dead node(s) and kills incorrect jobs Is there something left of old jobs in the spool directory of a node like /var/spool/sge/node02/jobs or /var/spool/sge/node02/active_jobs Do you have this directory in a ram disk when it's diskless and non-shared? The node you switch down is also not part of a parallel job, which is right now in a serial step without active `qrsh -inherit ...` to this particular node? -- Reuti > There are (currently) 2 minute delays between each of the steps above. > Several nodes change states in each step with a few seconds between > each action. > > Other Notes: > Running SUN SGE 6.2u5. > Only one queue instance on each node. > Compute nodes are diskless and do not mount a shared sge_root. > Power off used ipmi power off (no graceful shutdown). > A complex string variable is used to track state. > > I have settings: > load_report_time 00:00:40 > max_unheard 00:05:00 > reschedule_unknown 00:15:00 > > I think I have reproduced the problem manually without the complex > variable. > > SGE queue empty, all nodes enabled and powered on > qmaster restarted > wait for all nodes to reappear in qstat -f output > > 3:17 qsub -t 1-800 do_sge_burnin > 3:17 qstat -u \* > 3:18 qdel 16651 -t 33-800 > > qmaster restarted; wait for qstat -f output to reset > > 3:25 qmod -d green@bc08\* > > ipmitool power off nodes 80-89 in another window > > 3:26 qstat -f -q green@\* | sort -u > 3:26 qstat -u \* | sort > 3:27 qstat -f -q green@\* | sort -u > 3:27 qstat -u \* | sort > > The messages log file shows the jobs killed between 03:31:47 and > 03:32:25 (approximately one job per second). > > Any thoughts or known bugs? > > Thanks, > Stuart Barkley > -- > I've never been lost; I was once bewildered for three days, but never lost! > -- Daniel Boone > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
