Am 09.07.2011 um 16:19 schrieb Stuart Barkley: > On Sat, 9 Jul 2011 at 09:07 -0000, Reuti wrote: > >> Am 09.07.2011 um 10:20 schrieb Stuart Barkley: >> >>> I'm working on my green support code and am seeing an issue where >>> SGE appears to be killing all jobs with a messages like: >>> >>> 07/09/2011 02:14:07|worker|betsy-qmaster|E|[email protected] reports >>> running job (16648.32/master) in queue "[email protected]" that was not >>> supposed to be there - killing >> >> SGE thinks the node is reporting something despite the fact that >> it's switched off? > > SGE is killing jobs on nodes unrelated to the nodes powered off. It > appears to actually kill all other jobs on the cluster.
I can't remember seeing this. All nodes have fixed names and TCP/IP addresses even after a reboot? >>> Has any one seen anything like this? >>> >>> This seems to be triggered when I power off (several?) compute >>> nodes in a short period of time. >>> >>> The recent history of the nodes being powered off: >>> Node is enabled and running a job >>> Job finishes/is killed >>> Green code notices extra idle nodes: disables queues on the nodes: >>> 'qmod -d *@node' >>> 'qconf -mattr exechost complex_values green_state="$timestamp: disabled" >>> $node' >>> Green code notices disabled queue and still no jobs: Updates state: >>> 'qconf -mattr exechost complex_values green_state="$timestamp: _off" >>> $node' >>> power off nodes >>> (Apparently) SGE notices the dead node(s) and kills incorrect jobs >> >> Is there something left of old jobs in the spool directory of a node >> like /var/spool/sge/node02/jobs or /var/spool/sge/node02/active_jobs > > I'll need to take a look. It is possible that something was left > behind from earlier. I haven't rebooted all the other nodes recently Usually there should be something left from former runs in case you reset a node. This is the way SGE can detect former jobs thereon and removes e.g. the created scratch directory $TMPDIR on a node. If it is in /tmp in RAM anyway, this won't hurt though. But... >> Do you have this directory in a ram disk when it's diskless and >> non-shared? > > Yes, the spool directory is on local ram disk. Then ist can happen, that jobs are still listed in `qstat` although the node rebooted. > Historically, I've not liked shared NFS file systems with lots of R/W > across many systems and I started my installation testing with systems > without good shared NFS server. To lower traffic, it's sufficient to have the spool directory local, but all other stuff shared. -- Reuti > SGE does seem to have a good disk layout so things shouldn't have > problems and my main clusters now have much better shared filesystems > (NetApp and GPFS). > > I plan look at moving sge_root to a shared NFS mount in the near > future. > >> The node you switch down is also not part of a parallel job, which >> is right now in a serial step without active `qrsh -inherit ...` to >> this particular node? > > No, these are fully independent jobs. For my test they where all > individual members of an array job, but the problem has killed > unrelated jobs for other users (also members of array jobs). > > My test job has '-pe thread 8' but is only a single thread. I can try > again without the PE. > > Thanks, > Stuart > -- > I've never been lost; I was once bewildered for three days, but never lost! > -- Daniel Boone > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
