Hi All,

We're adding a Perceus-provisioned sub-cluster of 25 nodes to an 
existing heterogeneous (94%CentOS & 6%Ubuntu) but otherwise stable and 
well-functioning cluster (~190 nodes), running SGE 6.2.  The Perceus 
VNFS provides a small Debian-derived RAM-only OS (called gravityOS)

The Perceus-provisioned nodes have some odd issues with SGE.  When we 
boot the perceus nodes, they will execute the sgeexecd startup script 
but only ~8/25 will actually be recognized by the qmaster.  It is 
usually the same nodes that fail on reboot, but not always.  8 nodes 
always seem to restart corrrectly; 2-4 are variable, and the rest 
never start up normally.

In the /tmp/execd_messages.* log, from the ones that don't get picked 
up, we get the std exception error:
=========================================
07/11/2011 22:04:07|  main|n101|E|can't connect to service
07/11/2011 22:04:07|  main|n101|E|can't get configuration from qmaster 
-- 
backgrounding
=========================================

On reboot, the nodes come up and are available for ssh within 4 min, 
but even if we delay the sgeexecd script for another minute, they 
don't get picked up.  This is particularly odd since the nodes are 
booting the SAME image.

Running sgeexecd later manually does /not/ correct the problem either.  
At that point the only thing that allows them to be integrated into 
the cluster is restarting the qmaster and THEN starting sgeexecd on 
the previously excluded nodes.  After this, everything gets included 
and things run as normal.

We do not have this problem on either the older CentOS / Ubuntu nodes.

So while we can do the above action to bring all of them into the SGE 
system, it's an oddity that we'd like to resolve.  Anyone have insight 
into this?

-- 
Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
[ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487 
MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
--
Unhappy? Grouchy, pedantic old geezer available to follow you 
relentlessly until your current life seems like paradise.
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to