We recently had some planned downtime on our cluster to allow for testing of out machine room's electrical supply. Unfortunately when power was restored fsck found some corruption on the filesystem that holds our grid engine configuration. It was able to correct problems at the file system level but doesn't appear to have managed to get everything correct.
The main effects appear to be: 1)qconf -sel only lists 63 of the 904 hosts in the cluster. 2)One of our admin hosts no longer appears to be present. 3)Although all our cluster queues and hostgroups appear to be correct only a single queue instance is displayed by qstat -f Although I can correct problems 1 and 2 using qconf if I softstop and then restart the queue master they reoccur. In an attempt to fix 3 I've rerun the scripts that recreate the cluster queues with a cosmetcic change but this had no effect. While investigating (1) I found that while we appear to have a file in $SGE_ROOT/default/spool/exec_hosts for all 904 hosts not all of them contained an exec host configuration. Some appeared to be random files from the job spool instead. I've removed these and used qconf to redefine the exec_hosts but this did not make a difference. I'm guessing that somewhere in the grid engine config there is another file or files that aren't what they are supposed to be causing this. While I can restore from backup or just do a fresh install and run the various scripts which I used to create our config I was wondering if anyone had written a tool to validate the on disk config. We use classic spool on 6.2u3 William _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
