Aha!  maybe some progress, thanks to your pointers!

On Thursday 14 July 2011 16:23:29 Reuti wrote:
> Then please check on the qmaster and exec machine(s) the output
> when youn use the tools in $SGE_ROOT/utilbin/lx24_amd64 like
> 
> $ ./gethostbyaddr -all 10.255.78.3
Hostname: bduc-sched.nacs.uci.edu
SGE name: bduc-sched.nacs.uci.edu
Aliases:  bduc-sched 
Host Address(es): 128.200.15.19 

BUT!:

for IP in 120 196 143 121 106 241 184 170 161; do ./gethostbyaddr -all 
10.255.78.${IP}; done

# below corresponds to n101-n109
on bduc-sched>
$ for IP in 120 196 143 121 106 241 184 170 161; do ./gethostbyaddr -
all 10.255.78.${IP}; done                        
error resolving ip "10.255.78.120": can't resolve ip address (h_errno 
= HOST_NOT_FOUND)
error resolving ip "10.255.78.196": can't resolve ip address (h_errno 
= HOST_NOT_FOUND)
error resolving ip "10.255.78.143": can't resolve ip address (h_errno 
= HOST_NOT_FOUND)
error resolving ip "10.255.78.121": can't resolve ip address (h_errno 
= HOST_NOT_FOUND)
error resolving ip "10.255.78.106": can't resolve ip address (h_errno 
= HOST_NOT_FOUND)
error resolving ip "10.255.78.241": can't resolve ip address (h_errno 
= HOST_NOT_FOUND)
error resolving ip "10.255.78.184": can't resolve ip address (h_errno 
= HOST_NOT_FOUND)
error resolving ip "10.255.78.170": can't resolve ip address (h_errno 
= HOST_NOT_FOUND)
error resolving ip "10.255.78.161": can't resolve ip address (h_errno 
= HOST_NOT_FOUND)


This seems to be the result of an un-updated /etc/resolv.conf file 
which pointed to the public IP# of the nameserver (the head node) 
instead of the private IP# (10.255.78.2), which the rest of the 
cluster uses. (when executed on other cluster nodes, it resolves fine.

But when that file is corrected, I still get the same results as 
above, even after restarting the local ypserv and ypbind (and the 
qmaster).

But after restarting the qmaster, the 'bad' nodes can re-join the 
qmaster:
qhost | grep ^n 
n101                    lx24-amd64      2  0.00   15.8G   89.2M    
7.5G     0.0
n102                    lx24-amd64      2  0.00   15.8G   91.1M    
7.5G     0.0

(etc - all nodes rejoined)

and why should it work after a qmaster restart..?

(other commands return expected results.)

> $ ./gethostbyname -all bduc-sched

Hostname: bduc-sched.nacs.uci.edu
SGE name: bduc-sched.nacs.uci.edu
Aliases:  bduc-sched 
Host Address(es): 128.200.15.19 


> $ ./gethostbyname -all n103

Hostname: n103.bduc
SGE name: n103.bduc
Aliases:  
Host Address(es): 10.255.78.143 


> $ ./gethostname
Hostname: bduc-sched.nacs.uci.edu
Aliases:  bduc-sched 
Host Address(es): 128.200.15.19 


# following range includes both connectors and non-conectors:
for NAME in n101 n102 n103 n104 n105; do ./gethostbyname -all $NAME; 
done

Hostname: n101.bduc
SGE name: n101.bduc
Aliases:  
Host Address(es): 10.255.78.120 

Hostname: n102.bduc (connects OK)
SGE name: n102.bduc
Aliases:  
Host Address(es): 10.255.78.196 

Hostname: n103.bduc
SGE name: n103.bduc
Aliases:  
Host Address(es): 10.255.78.143 

Hostname: n104.bduc
SGE name: n104.bduc
Aliases:  
Host Address(es): 10.255.78.121 
Hostname: n105.bduc

SGE name: n105.bduc
Aliases:  
Host Address(es): 10.255.78.106 





> Match all up for the particular machines? You use NIS or so or all
> are recorded in local files?
> 
> Did you enable/disable in SGE to honor the FQDN (recorded in
> $SGE_ROOT/default/common/bootstrap)?
> 
> -- Reuti
> 
> > from one of the 'connected' nodes:
> > hmangala@n102:~
> > 501 $  qping bduc-sched 536 qmaster 1
> > 07/14/2011 23:12:16 endpoint bduc-sched.nacs.uci.edu/qmaster/1 at
> > port 536 is up since 191495 seconds 07/14/2011 23:12:17 endpoint
> > bduc-sched.nacs.uci.edu/qmaster/1 at port 536 is up since 191496
> > seconds 07/14/2011 23:12:18 endpoint
> > bduc-sched.nacs.uci.edu/qmaster/1 at port 536 is up since 191497
> > seconds hmangala@n102:~
> > 502 $ qping -info bduc-sched 536 qmaster 1
> > 07/14/2011 23:12:59:
> > SIRM version:             0.1
> > SIRM message id:          1
> > start time:               07/12/2011 18:00:41 (1310493641)
> > run time [s]:             191538
> > messages in read buffer:  0
> > messages in write buffer: 0
> > nr. of connected clients: 139
> > status:                   1
> > info:                     MAIN: E (191538.46) | signaler000: E
> > (191537.17) | event_master000: E (0.52) | timer000: E (0.52) |
> > worker000: E (1.09) | worker001: E (0.90) | listener000: E
> > (0.90) | listener001: E (1.25) | scheduler000: E (1.28) |
> > WARNING malloc:                   arena(15892480) |ordblks(3939)
> > | smblks(34) | hblksr(0) | hblhkd(0) usmblks(0) | fsmblks(1232)
> > | uordblks(6483312) | fordblks(9409168) | keepcost(133688)
> > Monitor:                  disabled

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487 
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
Unhappy? Grouchy, pedantic old geezer available to follow you 
relentlessly until your current life seems like paradise.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to