Re: [gridengine users] Perceus-booted nodes can't start sgeexecd

Harry Mangalam Fri, 15 Jul 2011 16:17:50 -0700

One last remark.  

After consulting with the sysadmin who runs our nameserver and NIS, it 
seems there were some errors including some permission errors on the 
nameserver files on our authoritative nameserver which resulted in the 
nameserver silently failing to read the files. (HUP'ing the named 
didn't allow the errors to go to the console; restarting it did.)


So overall, a string of small errors that were eventually corrected.

Sorry for the problems but thanks very much for the help in tracking 
them down.

hjm

On Friday 15 July 2011 12:29:34 Harry Mangalam wrote:
> On Friday 15 July 2011 03:06:26 you wrote:
> > You use either DNS or NIS, not both. What is the order in
> > /etc/nsswitch.conf
> 
> The current /etc/nsswitch.conf config is:
> passwd:     nis files
> shadow:     nis files
> group:      nis files
> hosts:      files nis dns
> bootparams: nisplus [NOTFOUND=return] files
> ethers:     files
> netmasks:   files
> networks:   files
> protocols:  files
> rpc:        files
> services:   files
> netgroup:   files nis
> publickey:  nisplus
> automount:  nis
> aliases:    files nisplus
> 
> (This brings up a complication that the NIS/YP side of things is
> handled by another admin who is not alway available.)
> 
> > I ususally have all machines hard coded in /etc/hosts on the
> > headnode and read this by NIS on all nodes. No nameserver or DNS
> > involved inside the cluster.
> 
> Well, the solution that you allude to (putting the IP/name info
> into the SGE qmaster /etc/hosts) //does// solve the problem.  So
> for that problem, this is now solved.
> 
> But I still don't understand why some of them booted and some
> didn't before that mod.
> 
> But thanks very much for your time, suggestions and pointers.
> 
> > -- Reuti
> > 
> > > But after restarting the qmaster, the 'bad' nodes can re-join
> > > the qmaster: qhost | grep ^n
> > > n101                    lx24-amd64      2  0.00   15.8G   89.2M
> > > 
> > >  7.5G     0.0 n102                    lx24-amd64      2  0.00
> > > 
> > > 15.8G   91.1M    7.5G     0.0 (etc - all nodes rejoined)
> > > and why should it work after a qmaster restart..?
> > > (other commands return expected results.)
> > > 
> > > > $ ./gethostbyname -all bduc-sched
> > > 
> > > Hostname: bduc-sched.nacs.uci.edu
> > > SGE name: bduc-sched.nacs.uci.edu
> > > Aliases:  bduc-sched
> > > Host Address(es): 128.200.15.19
> > > 
> > > > $ ./gethostbyname -all n103
> > > 
> > > Hostname: n103.bduc
> > > SGE name: n103.bduc
> > > Aliases:
> > > Host Address(es): 10.255.78.143
> > > 
> > > > $ ./gethostname
> > > 
> > > Hostname: bduc-sched.nacs.uci.edu
> > > Aliases:  bduc-sched
> > > Host Address(es): 128.200.15.19
> > > # following range includes both connectors and non-conectors:
> > > for NAME in n101 n102 n103 n104 n105; do ./gethostbyname -all
> > > $NAME; done Hostname: n101.bduc
> > > SGE name: n101.bduc
> > > Aliases:
> > > Host Address(es): 10.255.78.120
> > > Hostname: n102.bduc (connects OK)
> > > SGE name: n102.bduc
> > > Aliases:
> > > Host Address(es): 10.255.78.196
> > > Hostname: n103.bduc
> > > SGE name: n103.bduc
> > > Aliases:
> > > Host Address(es): 10.255.78.143
> > > Hostname: n104.bduc
> > > SGE name: n104.bduc
> > > Aliases:
> > > Host Address(es): 10.255.78.121
> > > Hostname: n105.bduc
> > > SGE name: n105.bduc
> > > Aliases:
> > > Host Address(es): 10.255.78.106
> > > 
> > > > Match all up for the particular machines? You use NIS or so
> > > > or all are recorded in local files?
> > > > 
> > > > Did you enable/disable in SGE to honor the FQDN (recorded in
> > > > $SGE_ROOT/default/common/bootstrap)?
> > > > 
> > > > -- Reuti
> > > > 
> > > > > from one of the 'connected' nodes:
> > > > > hmangala@n102:~
> > > > > 501 $  qping bduc-sched 536 qmaster 1
> > > > > 07/14/2011 23:12:16 endpoint
> > > > > bduc-sched.nacs.uci.edu/qmaster/1 at port 536 is up since
> > > > > 191495 seconds 07/14/2011 23:12:17 endpoint
> > > > > bduc-sched.nacs.uci.edu/qmaster/1 at port 536 is up since
> > > > > 191496 seconds 07/14/2011 23:12:18 endpoint
> > > > > bduc-sched.nacs.uci.edu/qmaster/1 at port 536 is up since
> > > > > 191497 seconds hmangala@n102:~
> > > > > 502 $ qping -info bduc-sched 536 qmaster 1
> > > > > 07/14/2011 23:12:59:
> > > > > SIRM version:             0.1
> > > > > SIRM message id:          1
> > > > > start time:               07/12/2011 18:00:41 (1310493641)
> > > > > run time [s]:             191538
> > > > > messages in read buffer:  0
> > > > > messages in write buffer: 0
> > > > > nr. of connected clients: 139
> > > > > status:                   1
> > > > > info:                     MAIN: E (191538.46) |
> > > > > signaler000: E (191537.17) | event_master000: E (0.52) |
> > > > > timer000: E (0.52) | worker000: E (1.09) | worker001: E
> > > > > (0.90) | listener000: E (0.90) | listener001: E (1.25) |
> > > > > scheduler000: E (1.28) | WARNING malloc:
> > > > > arena(15892480) |ordblks(3939)
> > > > > 
> > > > > | smblks(34) | hblksr(0) | hblhkd(0) usmblks(0) |
> > > > > | fsmblks(1232) uordblks(6483312) | fordblks(9409168) |
> > > > > | keepcost(133688)
> > > > > 
> > > > > Monitor:                  disabled

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487 
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
Unhappy? Grouchy, pedantic old geezer available to follow you 
relentlessly until your current life seems like paradise.

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Re: [gridengine users] Perceus-booted nodes can't start sgeexecd

Reply via email to