One last remark. After consulting with the sysadmin who runs our nameserver and NIS, it seems there were some errors including some permission errors on the nameserver files on our authoritative nameserver which resulted in the nameserver silently failing to read the files. (HUP'ing the named didn't allow the errors to go to the console; restarting it did.)
So overall, a string of small errors that were eventually corrected. Sorry for the problems but thanks very much for the help in tracking them down. hjm On Friday 15 July 2011 12:29:34 Harry Mangalam wrote: > On Friday 15 July 2011 03:06:26 you wrote: > > You use either DNS or NIS, not both. What is the order in > > /etc/nsswitch.conf > > The current /etc/nsswitch.conf config is: > passwd: nis files > shadow: nis files > group: nis files > hosts: files nis dns > bootparams: nisplus [NOTFOUND=return] files > ethers: files > netmasks: files > networks: files > protocols: files > rpc: files > services: files > netgroup: files nis > publickey: nisplus > automount: nis > aliases: files nisplus > > (This brings up a complication that the NIS/YP side of things is > handled by another admin who is not alway available.) > > > I ususally have all machines hard coded in /etc/hosts on the > > headnode and read this by NIS on all nodes. No nameserver or DNS > > involved inside the cluster. > > Well, the solution that you allude to (putting the IP/name info > into the SGE qmaster /etc/hosts) //does// solve the problem. So > for that problem, this is now solved. > > But I still don't understand why some of them booted and some > didn't before that mod. > > But thanks very much for your time, suggestions and pointers. > > > -- Reuti > > > > > But after restarting the qmaster, the 'bad' nodes can re-join > > > the qmaster: qhost | grep ^n > > > n101 lx24-amd64 2 0.00 15.8G 89.2M > > > > > > 7.5G 0.0 n102 lx24-amd64 2 0.00 > > > > > > 15.8G 91.1M 7.5G 0.0 (etc - all nodes rejoined) > > > and why should it work after a qmaster restart..? > > > (other commands return expected results.) > > > > > > > $ ./gethostbyname -all bduc-sched > > > > > > Hostname: bduc-sched.nacs.uci.edu > > > SGE name: bduc-sched.nacs.uci.edu > > > Aliases: bduc-sched > > > Host Address(es): 128.200.15.19 > > > > > > > $ ./gethostbyname -all n103 > > > > > > Hostname: n103.bduc > > > SGE name: n103.bduc > > > Aliases: > > > Host Address(es): 10.255.78.143 > > > > > > > $ ./gethostname > > > > > > Hostname: bduc-sched.nacs.uci.edu > > > Aliases: bduc-sched > > > Host Address(es): 128.200.15.19 > > > # following range includes both connectors and non-conectors: > > > for NAME in n101 n102 n103 n104 n105; do ./gethostbyname -all > > > $NAME; done Hostname: n101.bduc > > > SGE name: n101.bduc > > > Aliases: > > > Host Address(es): 10.255.78.120 > > > Hostname: n102.bduc (connects OK) > > > SGE name: n102.bduc > > > Aliases: > > > Host Address(es): 10.255.78.196 > > > Hostname: n103.bduc > > > SGE name: n103.bduc > > > Aliases: > > > Host Address(es): 10.255.78.143 > > > Hostname: n104.bduc > > > SGE name: n104.bduc > > > Aliases: > > > Host Address(es): 10.255.78.121 > > > Hostname: n105.bduc > > > SGE name: n105.bduc > > > Aliases: > > > Host Address(es): 10.255.78.106 > > > > > > > Match all up for the particular machines? You use NIS or so > > > > or all are recorded in local files? > > > > > > > > Did you enable/disable in SGE to honor the FQDN (recorded in > > > > $SGE_ROOT/default/common/bootstrap)? > > > > > > > > -- Reuti > > > > > > > > > from one of the 'connected' nodes: > > > > > hmangala@n102:~ > > > > > 501 $ qping bduc-sched 536 qmaster 1 > > > > > 07/14/2011 23:12:16 endpoint > > > > > bduc-sched.nacs.uci.edu/qmaster/1 at port 536 is up since > > > > > 191495 seconds 07/14/2011 23:12:17 endpoint > > > > > bduc-sched.nacs.uci.edu/qmaster/1 at port 536 is up since > > > > > 191496 seconds 07/14/2011 23:12:18 endpoint > > > > > bduc-sched.nacs.uci.edu/qmaster/1 at port 536 is up since > > > > > 191497 seconds hmangala@n102:~ > > > > > 502 $ qping -info bduc-sched 536 qmaster 1 > > > > > 07/14/2011 23:12:59: > > > > > SIRM version: 0.1 > > > > > SIRM message id: 1 > > > > > start time: 07/12/2011 18:00:41 (1310493641) > > > > > run time [s]: 191538 > > > > > messages in read buffer: 0 > > > > > messages in write buffer: 0 > > > > > nr. of connected clients: 139 > > > > > status: 1 > > > > > info: MAIN: E (191538.46) | > > > > > signaler000: E (191537.17) | event_master000: E (0.52) | > > > > > timer000: E (0.52) | worker000: E (1.09) | worker001: E > > > > > (0.90) | listener000: E (0.90) | listener001: E (1.25) | > > > > > scheduler000: E (1.28) | WARNING malloc: > > > > > arena(15892480) |ordblks(3939) > > > > > > > > > > | smblks(34) | hblksr(0) | hblhkd(0) usmblks(0) | > > > > > | fsmblks(1232) uordblks(6483312) | fordblks(9409168) | > > > > > | keepcost(133688) > > > > > > > > > > Monitor: disabled -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- Unhappy? Grouchy, pedantic old geezer available to follow you relentlessly until your current life seems like paradise.
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
