On Thursday 14 July 2011 15:31:25 Reuti wrote:
> Hi,
> 
> Am 14.07.2011 um 23:50 schrieb Harry Mangalam:
> > Hi All,
> > We're adding a Perceus-provisioned sub-cluster of 25 nodes to an
> > existing heterogeneous (94%CentOS & 6%Ubuntu) but otherwise
> > stable and well-functioning cluster (~190 nodes), running SGE
> > 6.2.  The Perceus VNFS provides a small Debian-derived RAM-only
> > OS (called gravityOS) The Perceus-provisioned nodes have some
> > odd issues with SGE.  When we boot the perceus nodes, they will
> > execute the sgeexecd startup script but only ~8/25 will actually
> > be recognized by the qmaster.  It is usually the same nodes that
> > fail on reboot, but not always.  8 nodes always seem to restart
> > corrrectly; 2-4 are variable, and the rest never start up
> > normally. In the /tmp/execd_messages.* log, from the ones that
> > don't get picked up, we get the std exception error:
> > =========================================
> > 07/11/2011 22:04:07|  main|n101|E|can't connect to service
> 
> in one not connecting nodes: can you telnet to the qmaster on port
> 6444 or the port you defined for the qmaster?

Hi Reuti - thanks for the timely note.

Our SGE env is fairly std so the ports are qmaster: 536, execd: 537

SGE_EXECD_PORT=537
SGE_QMASTER_PORT=536
SGE_ROOT=/sge62

and yes, I can telnet to ports 537 and 536 on at least several of the 
failing nodes.

hmangala@n101:~
503 $ telnet bduc-sched 536
Trying 10.255.78.3...
Connected to bduc-sched.nacs.uci.edu (10.255.78.3).
Escape character is '^]'.

...
hmangala@n103:~
504 $ telnet bduc-sched 536
Trying 10.255.78.3...
Connected to bduc-sched.nacs.uci.edu (10.255.78.3).
Escape character is '^]'.
...

> 
> Are all nodes always getting the same TCP/IP address?

yes, via an 'ethers' file - once the MAC is detected by Perceus, 
they'll always get the same IP # unless we explicitly delete them. I'e 
verified this by rebooting them repeatedly and they always come up 
with teh same IP#s .  here's a short result from 3 reboots:

node     reboot1        reboot2         reboot3
n101 [10.255.78.120] [10.255.78.120] [10.255.78.120]
n102 [10.255.78.196] [10.255.78.196] [10.255.78.196]
n103 [10.255.78.143] [10.255.78.143] [10.255.78.143]
n104 [10.255.78.121] [10.255.78.121] [10.255.78.121]
n105 [10.255.78.106] [10.255.78.106] [10.255.78.106]
n106 [10.255.78.241] [10.255.78.241] [10.255.78.241]
n107 [10.255.78.184] [10.255.78.184] [10.255.78.184]
n108 [10.255.78.170] [10.255.78.170] [10.255.78.170]
n109 [10.255.78.161] [10.255.78.161] [10.255.78.161]
...

> All nodes are connected to the same switch?

Yes - that rack has all of them going thru the same Gb switch.

> 
> -- Reuti

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487 
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
Unhappy? Grouchy, pedantic old geezer available to follow you 
relentlessly until your current life seems like paradise.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to