Hi guys, I inherited a cluster running SGE 6.2u3, so it's a bit on the old side. I had my storage node crash the other day and after a reboot the filesystem was dirty and wouldn't mount until after I'd run an xfs_repair, although I can't see anything missing as such.
The situation I have now is that whilst all my Cluster Queues as shown in qstat and in qmon (divided up as short, medium and long) are still there, the Queue Instances have disappeared for everything except the long queue. I tried to modify the Cluster Queues for say the short queue and all the hostlists were present as I'd expect. In qmon, it just shows broken queues as all zeros - zero in use, zero avail, zero total, zero in error, CQLOA of -NA-. I dug about in the filesystem to see if I'd lost files, but the spool/qinstance/medium/nodexx.cluster type files are all present and readable - just seems like GE is ignoring them (although I'm not sure if loss of them would have caused this behaviour). I found by messing around that if I cloned the short Cluster Queue via qmon to create a short2 queue, it would populate the Queue Instances correctly and I'd have my usual number of total slots and the short2 queue appeared to work fine and dandy. So my questions: - Any ideas why my GE lost the Queue Instances? - Is there an easier way to get them back? (Not that cloning a Cluster Queue is difficult, but if there's a more "correct" way to do it, then I'd rather know.) - Is there a qconf equivalent of qmon's Clone button? I'm a bit out of my depth with this and my google-fu seems to be letting me down. Thanks in advance.
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
