Hi All, We're adding a Perceus-provisioned sub-cluster of 25 nodes to an existing heterogeneous (94%CentOS & 6%Ubuntu) but otherwise stable and well-functioning cluster (~190 nodes), running SGE 6.2. The Perceus VNFS provides a small Debian-derived RAM-only OS (called gravityOS)
The Perceus-provisioned nodes have some odd issues with SGE. When we boot the perceus nodes, they will execute the sgeexecd startup script but only ~8/25 will actually be recognized by the qmaster. It is usually the same nodes that fail on reboot, but not always. 8 nodes always seem to restart corrrectly; 2-4 are variable, and the rest never start up normally. In the /tmp/execd_messages.* log, from the ones that don't get picked up, we get the std exception error: ========================================= 07/11/2011 22:04:07| main|n101|E|can't connect to service 07/11/2011 22:04:07| main|n101|E|can't get configuration from qmaster -- backgrounding ========================================= On reboot, the nodes come up and are available for ssh within 4 min, but even if we delay the sgeexecd script for another minute, they don't get picked up. This is particularly odd since the nodes are booting the SAME image. Running sgeexecd later manually does /not/ correct the problem either. At that point the only thing that allows them to be integrated into the cluster is restarting the qmaster and THEN starting sgeexecd on the previously excluded nodes. After this, everything gets included and things run as normal. We do not have this problem on either the older CentOS / Ubuntu nodes. So while we can do the above action to bring all of them into the SGE system, it's an oddity that we'd like to resolve. Anyone have insight into this? -- Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine [ZOT 2225] / 92697 Google Voice Multiplexer: (949) 478-4487 MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps) -- Unhappy? Grouchy, pedantic old geezer available to follow you relentlessly until your current life seems like paradise.
_______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
