Chad, That would interesting to understand as well. What was the nature of fault in that case? Active-standby controller heart-beat timeout or intra-node component health-check timeout? Information on the nature of fault may be useful in understanding the problem further ... IMHO, the later case may be considered ok depending upon a node's design. A CPU-bound infinite loop can significantly starve other user-level processes in the node. Such a starvation of other user-level processes (leading to intra-node healthcheck timeouts, say) can actually be considered a fault. Phani
________________________________ From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Chad Tindel Sent: Friday, December 21, 2007 7:55 PM To: [email protected] Subject: Re: [Users] Instable cluster with CPU load Some of the Opensaf threads are indeed run in real-time mode (that is, if you are running Opensaf with "root" priviledges). However, that may not be enough to handle stress on the following: (1) Memory: I see that the "java" process has a huge size. What do the top few lines of your "top" dump show? (2) Network Traffic: Heart-beats may be getting queued (and delayed) behind ordinary traffic if network traffic is very high. This may (just may :-)) be the case if your CPU load is associated with lot of network traffic. It may be useful to check the above two possibilities to help isolate the root cause. Others in HP have reported the same behavior just with a normal priority process stuck in a CPU-bound infinite loop. Chad
_______________________________________________ Users mailing list [email protected] http://list.opensaf.org/maillist/listinfo/users
