Am 15.07.2011 um 02:19 schrieb Harry Mangalam:
> Aha! maybe some progress, thanks to your pointers!
> On Thursday 14 July 2011 16:23:29 Reuti wrote:
> > Then please check on the qmaster and exec machine(s) the output
> > when youn use the tools in $SGE_ROOT/utilbin/lx24_amd64 like
> >
> > $ ./gethostbyaddr -all 10.255.78.3
> Hostname: bduc-sched.nacs.uci.edu
> SGE name: bduc-sched.nacs.uci.edu
> Aliases: bduc-sched
> Host Address(es): 128.200.15.19
> BUT!:
> for IP in 120 196 143 121 106 241 184 170 161; do ./gethostbyaddr -all
> 10.255.78.${IP}; done
> # below corresponds to n101-n109
> on bduc-sched>
> $ for IP in 120 196 143 121 106 241 184 170 161; do ./gethostbyaddr -all
> 10.255.78.${IP}; done
> error resolving ip "10.255.78.120": can't resolve ip address (h_errno =
> HOST_NOT_FOUND)
> error resolving ip "10.255.78.196": can't resolve ip address (h_errno =
> HOST_NOT_FOUND)
> error resolving ip "10.255.78.143": can't resolve ip address (h_errno =
> HOST_NOT_FOUND)
> error resolving ip "10.255.78.121": can't resolve ip address (h_errno =
> HOST_NOT_FOUND)
> error resolving ip "10.255.78.106": can't resolve ip address (h_errno =
> HOST_NOT_FOUND)
> error resolving ip "10.255.78.241": can't resolve ip address (h_errno =
> HOST_NOT_FOUND)
> error resolving ip "10.255.78.184": can't resolve ip address (h_errno =
> HOST_NOT_FOUND)
> error resolving ip "10.255.78.170": can't resolve ip address (h_errno =
> HOST_NOT_FOUND)
> error resolving ip "10.255.78.161": can't resolve ip address (h_errno =
> HOST_NOT_FOUND)
> This seems to be the result of an un-updated /etc/resolv.conf file which
> pointed to the public IP# of the nameserver (the head node) instead of the
> private IP# (10.255.78.2), which the rest of the cluster uses. (when executed
> on other cluster nodes, it resolves fine.
> But when that file is corrected, I still get the same results as above, even
> after restarting the local ypserv and ypbind (and the qmaster).
You use either DNS or NIS, not both. What is the order in /etc/nsswitch.conf
I ususally have all machines hard coded in /etc/hosts on the headnode and read
this by NIS on all nodes. No nameserver or DNS involved inside the cluster.
-- Reuti
> But after restarting the qmaster, the 'bad' nodes can re-join the qmaster:
> qhost | grep ^n
> n101 lx24-amd64 2 0.00 15.8G 89.2M 7.5G
> 0.0
> n102 lx24-amd64 2 0.00 15.8G 91.1M 7.5G
> 0.0
> (etc - all nodes rejoined)
> and why should it work after a qmaster restart..?
> (other commands return expected results.)
> > $ ./gethostbyname -all bduc-sched
> Hostname: bduc-sched.nacs.uci.edu
> SGE name: bduc-sched.nacs.uci.edu
> Aliases: bduc-sched
> Host Address(es): 128.200.15.19
> > $ ./gethostbyname -all n103
> Hostname: n103.bduc
> SGE name: n103.bduc
> Aliases:
> Host Address(es): 10.255.78.143
> > $ ./gethostname
> Hostname: bduc-sched.nacs.uci.edu
> Aliases: bduc-sched
> Host Address(es): 128.200.15.19
> # following range includes both connectors and non-conectors:
> for NAME in n101 n102 n103 n104 n105; do ./gethostbyname -all $NAME; done
> Hostname: n101.bduc
> SGE name: n101.bduc
> Aliases:
> Host Address(es): 10.255.78.120
> Hostname: n102.bduc (connects OK)
> SGE name: n102.bduc
> Aliases:
> Host Address(es): 10.255.78.196
> Hostname: n103.bduc
> SGE name: n103.bduc
> Aliases:
> Host Address(es): 10.255.78.143
> Hostname: n104.bduc
> SGE name: n104.bduc
> Aliases:
> Host Address(es): 10.255.78.121
> Hostname: n105.bduc
> SGE name: n105.bduc
> Aliases:
> Host Address(es): 10.255.78.106
> > Match all up for the particular machines? You use NIS or so or all
> > are recorded in local files?
> >
> > Did you enable/disable in SGE to honor the FQDN (recorded in
> > $SGE_ROOT/default/common/bootstrap)?
> >
> > -- Reuti
> >
> > > from one of the 'connected' nodes:
> > > hmangala@n102:~
> > > 501 $ qping bduc-sched 536 qmaster 1
> > > 07/14/2011 23:12:16 endpoint bduc-sched.nacs.uci.edu/qmaster/1 at
> > > port 536 is up since 191495 seconds 07/14/2011 23:12:17 endpoint
> > > bduc-sched.nacs.uci.edu/qmaster/1 at port 536 is up since 191496
> > > seconds 07/14/2011 23:12:18 endpoint
> > > bduc-sched.nacs.uci.edu/qmaster/1 at port 536 is up since 191497
> > > seconds hmangala@n102:~
> > > 502 $ qping -info bduc-sched 536 qmaster 1
> > > 07/14/2011 23:12:59:
> > > SIRM version: 0.1
> > > SIRM message id: 1
> > > start time: 07/12/2011 18:00:41 (1310493641)
> > > run time [s]: 191538
> > > messages in read buffer: 0
> > > messages in write buffer: 0
> > > nr. of connected clients: 139
> > > status: 1
> > > info: MAIN: E (191538.46) | signaler000: E
> > > (191537.17) | event_master000: E (0.52) | timer000: E (0.52) |
> > > worker000: E (1.09) | worker001: E (0.90) | listener000: E
> > > (0.90) | listener001: E (1.25) | scheduler000: E (1.28) |
> > > WARNING malloc: arena(15892480) |ordblks(3939)
> > > | smblks(34) | hblksr(0) | hblhkd(0) usmblks(0) | fsmblks(1232)
> > > | uordblks(6483312) | fordblks(9409168) | keepcost(133688)
> > > Monitor: disabled
> --
> Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
> [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487
> MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
> --
> Unhappy? Grouchy, pedantic old geezer available to follow you
> relentlessly until your current life seems like paradise.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users