Re: [gridengine users] message: ... reports running job ... that was not supposed to be there - killing

Stuart Barkley Sat, 09 Jul 2011 07:21:08 -0700

On Sat, 9 Jul 2011 at 09:07 -0000, Reuti wrote:

> Am 09.07.2011 um 10:20 schrieb Stuart Barkley:
>
> > I'm working on my green support code and am seeing an issue where
> > SGE appears to be killing all jobs with a messages like:
> >
> >  07/09/2011 02:14:07|worker|betsy-qmaster|E|[email protected] reports 
> > running job (16648.32/master) in queue "[email protected]" that was not 
> > supposed to be there - killing
>
> SGE thinks the node is reporting something despite the fact that
> it's switched off?


SGE is killing jobs on nodes unrelated to the nodes powered off.  It
appears to actually kill all other jobs on the cluster.

> > Has any one seen anything like this?
> >
> > This seems to be triggered when I power off (several?) compute
> > nodes in a short period of time.
> >
> > The recent history of the nodes being powered off:
> >  Node is enabled and running a job
> >  Job finishes/is killed
> >  Green code notices extra idle nodes: disables queues on the nodes:
> >    'qmod -d *@node'
> >    'qconf -mattr exechost complex_values green_state="$timestamp: disabled" 
> > $node'
> >  Green code notices disabled queue and still no jobs: Updates state:
> >    'qconf -mattr exechost complex_values green_state="$timestamp: _off" 
> > $node'
> >    power off nodes
> >  (Apparently) SGE notices the dead node(s) and kills incorrect jobs
>
> Is there something left of old jobs in the spool directory of a node
> like /var/spool/sge/node02/jobs or /var/spool/sge/node02/active_jobs

I'll need to take a look.  It is possible that something was left
behind from earlier.  I haven't rebooted all the other nodes recently

> Do you have this directory in a ram disk when it's diskless and
> non-shared?

Yes, the spool directory is on local ram disk.

Historically, I've not liked shared NFS file systems with lots of R/W
across many systems and I started my installation testing with systems
without good shared NFS server.

SGE does seem to have a good disk layout so things shouldn't have
problems and my main clusters now have much better shared filesystems
(NetApp and GPFS).

I plan look at moving sge_root to a shared NFS mount in the near
future.

> The node you switch down is also not part of a parallel job, which
> is right now in a serial step without active `qrsh -inherit ...` to
> this particular node?

No, these are fully independent jobs.  For my test they where all
individual members of an array job, but the problem has killed
unrelated jobs for other users (also members of array jobs).

My test job has '-pe thread 8' but is only a single thread.  I can try
again without the PE.

Thanks,
Stuart
-- 
I've never been lost; I was once bewildered for three days, but never lost!
                                        --  Daniel Boone
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] message: ... reports running job ... that was not supposed to be there - killing

Reply via email to