On Thursday 14 July 2011 15:31:25 Reuti wrote: > Hi, > > Am 14.07.2011 um 23:50 schrieb Harry Mangalam: > > Hi All, > > We're adding a Perceus-provisioned sub-cluster of 25 nodes to an > > existing heterogeneous (94%CentOS & 6%Ubuntu) but otherwise > > stable and well-functioning cluster (~190 nodes), running SGE > > 6.2. The Perceus VNFS provides a small Debian-derived RAM-only > > OS (called gravityOS) The Perceus-provisioned nodes have some > > odd issues with SGE. When we boot the perceus nodes, they will > > execute the sgeexecd startup script but only ~8/25 will actually > > be recognized by the qmaster. It is usually the same nodes that > > fail on reboot, but not always. 8 nodes always seem to restart > > corrrectly; 2-4 are variable, and the rest never start up > > normally. In the /tmp/execd_messages.* log, from the ones that > > don't get picked up, we get the std exception error: > > ========================================= > > 07/11/2011 22:04:07| main|n101|E|can't connect to service > > in one not connecting nodes: can you telnet to the qmaster on port > 6444 or the port you defined for the qmaster?
Hi Reuti - thanks for the timely note. Our SGE env is fairly std so the ports are qmaster: 536, execd: 537 SGE_EXECD_PORT=537 SGE_QMASTER_PORT=536 SGE_ROOT=/sge62 and yes, I can telnet to ports 537 and 536 on at least several of the failing nodes. hmangala@n101:~ 503 $ telnet bduc-sched 536 Trying 10.255.78.3... Connected to bduc-sched.nacs.uci.edu (10.255.78.3). Escape character is '^]'. ... hmangala@n103:~ 504 $ telnet bduc-sched 536 Trying 10.255.78.3... Connected to bduc-sched.nacs.uci.edu (10.255.78.3). Escape character is '^]'. ... > > Are all nodes always getting the same TCP/IP address? yes, via an 'ethers' file - once the MAC is detected by Perceus, they'll always get the same IP # unless we explicitly delete them. I'e verified this by rebooting them repeatedly and they always come up with teh same IP#s . here's a short result from 3 reboots: node reboot1 reboot2 reboot3 n101 [10.255.78.120] [10.255.78.120] [10.255.78.120] n102 [10.255.78.196] [10.255.78.196] [10.255.78.196] n103 [10.255.78.143] [10.255.78.143] [10.255.78.143] n104 [10.255.78.121] [10.255.78.121] [10.255.78.121] n105 [10.255.78.106] [10.255.78.106] [10.255.78.106] n106 [10.255.78.241] [10.255.78.241] [10.255.78.241] n107 [10.255.78.184] [10.255.78.184] [10.255.78.184] n108 [10.255.78.170] [10.255.78.170] [10.255.78.170] n109 [10.255.78.161] [10.255.78.161] [10.255.78.161] ... > All nodes are connected to the same switch? Yes - that rack has all of them going thru the same Gb switch. > > -- Reuti -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- Unhappy? Grouchy, pedantic old geezer available to follow you relentlessly until your current life seems like paradise.
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
