Hi,

Am 14.07.2011 um 23:50 schrieb Harry Mangalam:

> Hi All,
> We're adding a Perceus-provisioned sub-cluster of 25 nodes to an existing 
> heterogeneous (94%CentOS & 6%Ubuntu) but otherwise stable and 
> well-functioning cluster (~190 nodes), running SGE 6.2.  The Perceus VNFS 
> provides a small Debian-derived RAM-only OS (called gravityOS)
> The Perceus-provisioned nodes have some odd issues with SGE.  When we boot 
> the perceus nodes, they will execute the sgeexecd startup script but only 
> ~8/25 will actually be recognized by the qmaster.  It is usually the same 
> nodes that fail on reboot, but not always.  8 nodes always seem to restart 
> corrrectly; 2-4 are variable, and the rest never start up normally.
> In the /tmp/execd_messages.* log, from the ones that don't get picked up, we 
> get the std exception error:
> =========================================
> 07/11/2011 22:04:07|  main|n101|E|can't connect to service

in one not connecting nodes: can you telnet to the qmaster on port 6444 or the 
port you defined for the qmaster?

Are all nodes always getting the same TCP/IP address?

All nodes are connected to the same switch?

-- Reuti


> 07/11/2011 22:04:07|  main|n101|E|can't get configuration from qmaster -- 
> backgrounding
> =========================================
> On reboot, the nodes come up and are available for ssh within 4 min, but even 
> if we delay the sgeexecd script for another minute, they don't get picked up. 
>  This is particularly odd since the nodes are booting the SAME image.
> Running sgeexecd later manually does /not/ correct the problem either.  At 
> that point the only thing that allows them to be integrated into the cluster 
> is restarting the qmaster and THEN starting sgeexecd on the previously 
> excluded nodes.  After this, everything gets included and things run as 
> normal.
> We do not have this problem on either the older CentOS / Ubuntu nodes.
> So while we can do the above action to bring all of them into the SGE system, 
> it's an oddity that we'd like to resolve.  Anyone have insight into this?
> -- 
> Harry Mangalam - Research Computing, OIT, Rm 225 MSTB, UC Irvine
> [ZOT 2225] / 92697  Google Voice Multiplexer: (949) 478-4487 
> MSTB Lat/Long: (33.642025,-117.844414) (paste into Google Maps)
> --
> Unhappy? Grouchy, pedantic old geezer available to follow you 
> relentlessly until your current life seems like paradise.
> _______________________________________________
> users mailing list
> [email protected]
> https://gridengine.org/mailman/listinfo/users


_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to