Am 15.07.2011 um 21:29 schrieb Harry Mangalam: > On Friday 15 July 2011 03:06:26 you wrote: > > You use either DNS or NIS, not both. What is the order in > > /etc/nsswitch.conf
Is it really a mixture of NIS and NISPLUS? I thought these need different daemons. > The current /etc/nsswitch.conf config is: > passwd: nis files > shadow: nis files I saw: passwd: compat group: compat shadow: compat there up to now. > group: nis files > hosts: files nis dns $ ypcat hosts gives reliable results? Maybe there are false entries in the local /etc/hosts which you copy to all nodes. If some of the nodes are listed therein, these would get a name which could be different from the one delivered by NIS. Can you compare the files of the qmaster's /etc/hosts with the one on the node and the output of `ypcat hosts`. Maybe the NIS tables weren't rebuild since the last change by `make -C /var/yp`. -- Reuti > bootparams: nisplus [NOTFOUND=return] files > ethers: files > netmasks: files > networks: files > protocols: files > rpc: files > services: files > netgroup: files nis > publickey: nisplus > automount: nis > aliases: files nisplus > (This brings up a complication that the NIS/YP side of things is handled by > another admin who is not alway available.) > > I ususally have all machines hard coded in /etc/hosts on the > > headnode and read this by NIS on all nodes. No nameserver or DNS > > involved inside the cluster. > Well, the solution that you allude to (putting the IP/name info into the SGE > qmaster /etc/hosts) //does// solve the problem. So for that problem, this is > now solved. > But I still don't understand why some of them booted and some didn't before > that mod. > But thanks very much for your time, suggestions and pointers. > > -- Reuti > > > > > But after restarting the qmaster, the 'bad' nodes can re-join the > > > qmaster: qhost | grep ^n > > > n101 lx24-amd64 2 0.00 15.8G 89.2M > > > 7.5G 0.0 n102 lx24-amd64 2 0.00 > > > 15.8G 91.1M 7.5G 0.0 (etc - all nodes rejoined) > > > and why should it work after a qmaster restart..? > > > (other commands return expected results.) > > > > > > > $ ./gethostbyname -all bduc-sched > > > > > > Hostname: bduc-sched.nacs.uci.edu > > > SGE name: bduc-sched.nacs.uci.edu > > > Aliases: bduc-sched > > > Host Address(es): 128.200.15.19 > > > > > > > $ ./gethostbyname -all n103 > > > > > > Hostname: n103.bduc > > > SGE name: n103.bduc > > > Aliases: > > > Host Address(es): 10.255.78.143 > > > > > > > $ ./gethostname > > > > > > Hostname: bduc-sched.nacs.uci.edu > > > Aliases: bduc-sched > > > Host Address(es): 128.200.15.19 > > > # following range includes both connectors and non-conectors: > > > for NAME in n101 n102 n103 n104 n105; do ./gethostbyname -all > > > $NAME; done Hostname: n101.bduc > > > SGE name: n101.bduc > > > Aliases: > > > Host Address(es): 10.255.78.120 > > > Hostname: n102.bduc (connects OK) > > > SGE name: n102.bduc > > > Aliases: > > > Host Address(es): 10.255.78.196 > > > Hostname: n103.bduc > > > SGE name: n103.bduc > > > Aliases: > > > Host Address(es): 10.255.78.143 > > > Hostname: n104.bduc > > > SGE name: n104.bduc > > > Aliases: > > > Host Address(es): 10.255.78.121 > > > Hostname: n105.bduc > > > SGE name: n105.bduc > > > Aliases: > > > Host Address(es): 10.255.78.106 > > > > > > > Match all up for the particular machines? You use NIS or so or > > > > all are recorded in local files? > > > > > > > > Did you enable/disable in SGE to honor the FQDN (recorded in > > > > $SGE_ROOT/default/common/bootstrap)? > > > > > > > > -- Reuti > > > > > > > > > from one of the 'connected' nodes: > > > > > hmangala@n102:~ > > > > > 501 $ qping bduc-sched 536 qmaster 1 > > > > > 07/14/2011 23:12:16 endpoint > > > > > bduc-sched.nacs.uci.edu/qmaster/1 at port 536 is up since > > > > > 191495 seconds 07/14/2011 23:12:17 endpoint > > > > > bduc-sched.nacs.uci.edu/qmaster/1 at port 536 is up since > > > > > 191496 seconds 07/14/2011 23:12:18 endpoint > > > > > bduc-sched.nacs.uci.edu/qmaster/1 at port 536 is up since > > > > > 191497 seconds hmangala@n102:~ > > > > > 502 $ qping -info bduc-sched 536 qmaster 1 > > > > > 07/14/2011 23:12:59: > > > > > SIRM version: 0.1 > > > > > SIRM message id: 1 > > > > > start time: 07/12/2011 18:00:41 (1310493641) > > > > > run time [s]: 191538 > > > > > messages in read buffer: 0 > > > > > messages in write buffer: 0 > > > > > nr. of connected clients: 139 > > > > > status: 1 > > > > > info: MAIN: E (191538.46) | signaler000: > > > > > E (191537.17) | event_master000: E (0.52) | timer000: E > > > > > (0.52) | worker000: E (1.09) | worker001: E (0.90) | > > > > > listener000: E (0.90) | listener001: E (1.25) | > > > > > scheduler000: E (1.28) | WARNING malloc: > > > > > arena(15892480) |ordblks(3939) > > > > > > > > > > | smblks(34) | hblksr(0) | hblhkd(0) usmblks(0) | > > > > > | fsmblks(1232) uordblks(6483312) | fordblks(9409168) | > > > > > | keepcost(133688) > > > > > > > > > > Monitor: disabled > -- > Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine > [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 > MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) > -- > Unhappy? Grouchy, pedantic old geezer available to follow you > relentlessly until your current life seems like paradise. _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
