We have seen a problem that looks pretty bad. AMF reports health check timeout for a couple of components simultaneously. Since there is probably nothing wrong with the components the possible reasons for this could be: - the missing patch (soon to be integrated) - AMF stops sending health checks - MDS/TIPS hang-up
Syslog excerpt: Jan 10 11:12:52 SC_2_1 ncs_scap: NCS_AvSv: Card going for reboot -safComp=CompT_MQD,safSu=SuT_NCS_CNTLR,safNode=SC_2_1 faulted due to 6 -rcvr=9 Jan 10 11:12:52 SC_2_1 ncs_scap: NCS_AvSv: Card going for reboot -safComp=CompT_GLD,safSu=SuT_NCS_CNTLR,safNode=SC_2_1 faulted due to 6 -rcvr=9 Jan 10 11:12:57 SC_2_1 ncs_scap: NCS_AvSv: Card going for reboot - Some one has reset this card Jan 10 11:12:58 SC_2_1 shutdown[12802]: shutting down for system reboot Jan 10 11:13:04 SC_2_1 ncs_scap: NCS_AvSv: Card going for reboot -safComp=CompT_MAS,safSu=SuT_NCS_CNTLR,safNode=SC_2_1 faulted due to 6 -rcvr=9 Jan 10 11:13:06 SC_2_1 ncs_scap: NCS_AvSv: Card going for reboot -safComp=CompT_EDS,safSu=SuT_EDS,safNode=SC_2_1 faulted due to 6 -rcvr=9 Jan 10 11:13:39 SC_2_1 ncs_scap: NCS_AvSv: Card going for reboot -safComp=CompT_DTS,safSu=SuT_NCS_CNTLR,safNode=SC_2_1 faulted due to 6 -rcvr=9 Jan 10 10:13:53 SC_2_1 init: Switching to runlevel: 6 Jan 10 11:13:54 SC_2_1 shutdown: THE SYSTEM IS SHUTTING DOWN No core dumps, nothing more interesting than this. The problem has been seen once, maybe twice. Our application was running on the payloads using check points and events as mentioned before. The processor load was probably 50-60% on all processors (controllers and payloads). In order to be able to run with 60% load, we doubled the rcHbInt to 6s in BOM.xml. I will try to generate debug info in the /etc/opt/opensaf/reboot script (change from symlink to script) that is called by OpenSAF. This would be helpful if the problem is seen again. What is your opinion? Regards, Hans _______________________________________________ Users mailing list [email protected] http://list.opensaf.org/maillist/listinfo/users
