We are trying to run a cluster with a CPU load of appr. 40%. The test
application uses CKPT and EVT.
What happens is that the active controller logs that it has missed heartbeats
with other nodes in the cluster. When it misses the heartbeat with the
standby controller, it orders it to reboot.
The heartbeat settings in BOM.xml are default as delivered in OpenSAF:
<sndHbInt>1000</sndHbInt>
<rcvHbInt>3000</rcvHbInt>
Irrespective of this configuration values, I thought the system was designed
with real time threads for managing critical protocols?
And it seems to be:
> SC_2_2# ps -eLfc | grep scap
> root 1371 1 1371 14 TS 27 Dec17 ? 00:00:00
> /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root 1371 1 1372 14 RR 125 Dec17 ? 00:00:00
> /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root 1371 1 1373 14 RR 130 Dec17 ? 00:00:00
> /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root 1371 1 1374 14 RR 126 Dec17 ? 00:02:44
> /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root 1371 1 1375 14 TS 29 Dec17 ? 00:00:00
> /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root 1371 1 1376 14 TS 29 Dec17 ? 00:00:00
> /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root 1371 1 1377 14 TS 29 Dec17 ? 00:00:02
> /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root 1371 1 1378 14 TS 29 Dec17 ? 00:00:00
> /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root 1371 1 1379 14 TS 29 Dec17 ? 00:00:04
> /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root 1371 1 1380 14 TS 29 Dec17 ? 00:00:00
> /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root 1371 1 1381 14 TS 29 Dec17 ? 00:00:00
> /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root 1371 1 1382 14 TS 29 Dec17 ? 00:00:00
> /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root 1371 1 1384 14 RR 126 Dec17 ? 00:01:07
> /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
> root 1371 1 1390 14 TS 29 Dec17 ? 00:00:00
> /opt/opensaf/controller/bin/ncs_scap ROLE=1 NID_SVC_ID=21
The process load on the active looks something like:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
2458 root 23 4 1389m 236m 8916 S 55 5.8 6:40.88 java
1635 root 20 4 1004m 8072 2228 S 15 0.2 1:54.67 rssServer
1394 root 26 4 52284 2076 1440 S 3 0.1 0:28.54 ncs_cpd
1324 root 20 4 54928 3692 1744 S 1 0.1 0:03.60 ncs_dts
1350 root 20 4 64244 13m 1428 S 0 0.3 0:01.82 ncs_eds
1398 root 23 4 52468 2716 2020 S 0 0.1 0:04.49 ncs_cpnd
Could there some error in the AVD-AVD/AVND heartbeat design?
In the NCS-AVSV-MIB I see other default values for heartbeats, 300ms resp
2000ms.
Regards,
Hans
_______________________________________________
Users mailing list
[email protected]
http://list.opensaf.org/maillist/listinfo/users