Hi, Am 14.07.2011 um 23:50 schrieb Harry Mangalam:
> Hi All, > We're adding a Perceus-provisioned sub-cluster of 25 nodes to an existing > heterogeneous (94%CentOS & 6%Ubuntu) but otherwise stable and > well-functioning cluster (~190 nodes), running SGE 6.2. The Perceus VNFS > provides a small Debian-derived RAM-only OS (called gravityOS) > The Perceus-provisioned nodes have some odd issues with SGE. When we boot > the perceus nodes, they will execute the sgeexecd startup script but only > ~8/25 will actually be recognized by the qmaster. It is usually the same > nodes that fail on reboot, but not always. 8 nodes always seem to restart > corrrectly; 2-4 are variable, and the rest never start up normally. > In the /tmp/execd_messages.* log, from the ones that don't get picked up, we > get the std exception error: > ========================================= > 07/11/2011 22:04:07| main|n101|E|can't connect to service in one not connecting nodes: can you telnet to the qmaster on port 6444 or the port you defined for the qmaster? Are all nodes always getting the same TCP/IP address? All nodes are connected to the same switch? -- Reuti > 07/11/2011 22:04:07| main|n101|E|can't get configuration from qmaster -- > backgrounding > ========================================= > On reboot, the nodes come up and are available for ssh within 4 min, but even > if we delay the sgeexecd script for another minute, they don't get picked up. > This is particularly odd since the nodes are booting the SAME image. > Running sgeexecd later manually does /not/ correct the problem either. At > that point the only thing that allows them to be integrated into the cluster > is restarting the qmaster and THEN starting sgeexecd on the previously > excluded nodes. After this, everything gets included and things run as > normal. > We do not have this problem on either the older CentOS / Ubuntu nodes. > So while we can do the above action to bring all of them into the SGE system, > it's an oddity that we'd like to resolve. Anyone have insight into this? > -- > Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine > [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 > MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) > -- > Unhappy? Grouchy, pedantic old geezer available to follow you > relentlessly until your current life seems like paradise. > _______________________________________________ > users mailing list > [email protected] > https://gridengine.org/mailman/listinfo/users _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
