Chad, 
 
That would interesting to understand as well. What was the nature of
fault in that case?  Active-standby controller heart-beat timeout or
intra-node component health-check timeout?   Information on the nature
of fault may be useful in understanding the problem further ...
 
IMHO, the later case may be considered ok depending upon a node's
design. A CPU-bound infinite loop can significantly starve other
user-level processes in the node. Such a starvation of other user-level
processes (leading to intra-node healthcheck timeouts, say) can actually
be considered a fault.
 
Phani

________________________________

From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of Chad Tindel
Sent: Friday, December 21, 2007 7:55 PM
To: [email protected]
Subject: Re: [Users] Instable cluster with CPU load



        Some of the Opensaf threads are indeed run in real-time mode
(that is, 
        if you are running Opensaf with "root" priviledges).  However,
that may
        not be enough to handle stress on the following:
        
        (1)  Memory:   I see that the "java" process has a huge size.
What do 
        the top few lines of your "top" dump show?
        
        (2)  Network Traffic: Heart-beats may be getting queued (and
delayed)
        behind ordinary traffic if network traffic is very high. This
may (just
        may :-)) be the case if your CPU load is associated with lot of
network 
        traffic.
        
        It may be useful to check the above two possibilities to help
isolate
        the root cause.
        


Others in HP have reported the same behavior just with a normal priority
process stuck in a CPU-bound infinite loop. 

Chad


_______________________________________________
Users mailing list
[email protected]
http://list.opensaf.org/maillist/listinfo/users

Reply via email to