Am 09.07.2011 um 16:19 schrieb Stuart Barkley:

> On Sat, 9 Jul 2011 at 09:07 -0000, Reuti wrote:
> 
>> Am 09.07.2011 um 10:20 schrieb Stuart Barkley:
>> 
>>> I'm working on my green support code and am seeing an issue where
>>> SGE appears to be killing all jobs with a messages like:
>>> 
>>> 07/09/2011 02:14:07|worker|betsy-qmaster|E|[email protected] reports 
>>> running job (16648.32/master) in queue "[email protected]" that was not 
>>> supposed to be there - killing
>> 
>> SGE thinks the node is reporting something despite the fact that
>> it's switched off?
> 
> SGE is killing jobs on nodes unrelated to the nodes powered off.  It
> appears to actually kill all other jobs on the cluster.

I can't remember seeing this.

All nodes have fixed names and TCP/IP addresses even after a reboot?


>>> Has any one seen anything like this?
>>> 
>>> This seems to be triggered when I power off (several?) compute
>>> nodes in a short period of time.
>>> 
>>> The recent history of the nodes being powered off:
>>> Node is enabled and running a job
>>> Job finishes/is killed
>>> Green code notices extra idle nodes: disables queues on the nodes:
>>>   'qmod -d *@node'
>>>   'qconf -mattr exechost complex_values green_state="$timestamp: disabled" 
>>> $node'
>>> Green code notices disabled queue and still no jobs: Updates state:
>>>   'qconf -mattr exechost complex_values green_state="$timestamp: _off" 
>>> $node'
>>>   power off nodes
>>> (Apparently) SGE notices the dead node(s) and kills incorrect jobs
>> 
>> Is there something left of old jobs in the spool directory of a node
>> like /var/spool/sge/node02/jobs or /var/spool/sge/node02/active_jobs
> 
> I'll need to take a look.  It is possible that something was left
> behind from earlier.  I haven't rebooted all the other nodes recently

Usually there should be something left from former runs in case you reset a 
node. This is the way SGE can detect former jobs thereon and removes e.g. the 
created scratch directory $TMPDIR on a node. If it is in /tmp in RAM anyway, 
this won't hurt though. But...


>> Do you have this directory in a ram disk when it's diskless and
>> non-shared?
> 
> Yes, the spool directory is on local ram disk.

Then ist can happen, that jobs are still listed in `qstat` although the node 
rebooted.


> Historically, I've not liked shared NFS file systems with lots of R/W
> across many systems and I started my installation testing with systems
> without good shared NFS server.

To lower traffic, it's sufficient to have the spool directory local, but all 
other stuff shared.

-- Reuti


> SGE does seem to have a good disk layout so things shouldn't have
> problems and my main clusters now have much better shared filesystems
> (NetApp and GPFS).
> 
> I plan look at moving sge_root to a shared NFS mount in the near
> future.
> 
>> The node you switch down is also not part of a parallel job, which
>> is right now in a serial step without active `qrsh -inherit ...` to
>> this particular node?
> 
> No, these are fully independent jobs.  For my test they where all
> individual members of an array job, but the problem has killed
> unrelated jobs for other users (also members of array jobs).
> 
> My test job has '-pe thread 8' but is only a single thread.  I can try
> again without the PE.
> 
> Thanks,
> Stuart
> -- 
> I've never been lost; I was once bewildered for three days, but never lost!
>                                        --  Daniel Boone
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to