Re: [gridengine users] message: ... reports running job ... that was not supposed to be there - killing

Reuti Sat, 09 Jul 2011 06:09:05 -0700

Am 09.07.2011 um 10:20 schrieb Stuart Barkley:

> I'm working on my green support code and am seeing an issue where SGE
> appears to be killing all jobs with a messages like:
> 
>  07/09/2011 02:14:07|worker|betsy-qmaster|E|[email protected] reports 
> running job (16648.32/master) in queue "[email protected]" that was not 
> supposed to be there - killing


SGE thinks the node is reporting something despite the fact that it's switched 
off?


> Has any one seen anything like this?
> 
> This seems to be triggered when I power off (several?) compute nodes
> in a short period of time.
> 
> The recent history of the nodes being powered off:
>  Node is enabled and running a job
>  Job finishes/is killed
>  Green code notices extra idle nodes: disables queues on the nodes:
>    'qmod -d *@node'
>    'qconf -mattr exechost complex_values green_state="$timestamp: disabled" 
> $node'
>  Green code notices disabled queue and still no jobs: Updates state:
>    'qconf -mattr exechost complex_values green_state="$timestamp: _off" $node'
>    power off nodes
>  (Apparently) SGE notices the dead node(s) and kills incorrect jobs

Is there something left of old jobs in the spool directory of a node like 
/var/spool/sge/node02/jobs or /var/spool/sge/node02/active_jobs Do you have 
this directory in a ram disk when it's diskless and non-shared?

The node you switch down is also not part of a parallel job, which is right now 
in a serial step without active `qrsh -inherit ...` to this particular node?

-- Reuti


> There are (currently) 2 minute delays between each of the steps above.
> Several nodes change states in each step with a few seconds between
> each action.
> 
> Other Notes:
>  Running SUN SGE 6.2u5.
>  Only one queue instance on each node.
>  Compute nodes are diskless and do not mount a shared sge_root.
>  Power off used ipmi power off (no graceful shutdown).
>  A complex string variable is used to track state.
> 
> I have settings:
>  load_report_time             00:00:40
>  max_unheard                  00:05:00
>  reschedule_unknown           00:15:00
> 
> I think I have reproduced the problem manually without the complex
> variable.
> 
>   SGE queue empty, all nodes enabled and powered on
>   qmaster restarted
>   wait for all nodes to reappear in qstat -f output
> 
>   3:17    qsub -t 1-800 do_sge_burnin
>   3:17    qstat -u \*
>   3:18    qdel 16651 -t 33-800
> 
>   qmaster restarted; wait for qstat -f output to reset
> 
>   3:25    qmod -d green@bc08\*
> 
>   ipmitool power off nodes 80-89 in another window
> 
>   3:26    qstat -f -q green@\* | sort -u
>   3:26    qstat -u \* | sort
>   3:27    qstat -f -q green@\* | sort -u
>   3:27    qstat -u \* | sort
> 
>   The messages log file shows the jobs killed between 03:31:47 and
>   03:32:25 (approximately one job per second).
> 
> Any thoughts or known bugs?
> 
> Thanks,
> Stuart Barkley
> -- 
> I've never been lost; I was once bewildered for three days, but never lost!
>                                        --  Daniel Boone
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] message: ... reports running job ... that was not supposed to be there - killing

Reply via email to