I'm working on my green support code and am seeing an issue where SGE
appears to be killing all jobs with a messages like:

  07/09/2011 02:14:07|worker|betsy-qmaster|E|[email protected] reports 
running job (16648.32/master) in queue "[email protected]" that was not 
supposed to be there - killing

Has any one seen anything like this?

This seems to be triggered when I power off (several?) compute nodes
in a short period of time.

The recent history of the nodes being powered off:
  Node is enabled and running a job
  Job finishes/is killed
  Green code notices extra idle nodes: disables queues on the nodes:
    'qmod -d *@node'
    'qconf -mattr exechost complex_values green_state="$timestamp: disabled" 
$node'
  Green code notices disabled queue and still no jobs: Updates state:
    'qconf -mattr exechost complex_values green_state="$timestamp: _off" $node'
    power off nodes
  (Apparently) SGE notices the dead node(s) and kills incorrect jobs

There are (currently) 2 minute delays between each of the steps above.
Several nodes change states in each step with a few seconds between
each action.

Other Notes:
  Running SUN SGE 6.2u5.
  Only one queue instance on each node.
  Compute nodes are diskless and do not mount a shared sge_root.
  Power off used ipmi power off (no graceful shutdown).
  A complex string variable is used to track state.

I have settings:
  load_report_time             00:00:40
  max_unheard                  00:05:00
  reschedule_unknown           00:15:00

I think I have reproduced the problem manually without the complex
variable.

   SGE queue empty, all nodes enabled and powered on
   qmaster restarted
   wait for all nodes to reappear in qstat -f output

   3:17    qsub -t 1-800 do_sge_burnin
   3:17    qstat -u \*
   3:18    qdel 16651 -t 33-800

   qmaster restarted; wait for qstat -f output to reset

   3:25    qmod -d green@bc08\*

   ipmitool power off nodes 80-89 in another window

   3:26    qstat -f -q green@\* | sort -u
   3:26    qstat -u \* | sort
   3:27    qstat -f -q green@\* | sort -u
   3:27    qstat -u \* | sort

   The messages log file shows the jobs killed between 03:31:47 and
   03:32:25 (approximately one job per second).

Any thoughts or known bugs?

Thanks,
Stuart Barkley
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to