Hi guys. I've just started to work with pacemaker and have a problem with monitored service.
I've already configured three Active/Stand-by clusters with pacemaker. Running resources: IPaddr2 asterisk daemon bacula fd snmp daemon First and second cluster are working fine - I didn't notice any failures. But the third cluster fails too frequently. Schema is similar for all clusters: Master (active) <-corosync-> Slave(Stand-by) There is no difference between three cluster, apart server loading. Asterisk that is running on third cluster has 500+ customers and processes much more calls than others. So that, cluster periodically thinks that asterisk is not running: * ASTERISK_monitor_2000 on node1-Master 'not running' (7): call=85, status=complete, exitreason='none', last-rc-change='Tue Nov 7 15:06:16 2017', queued=0ms, exec=0ms And restarts it (because on-fail=restart parameter for asterisk primitive). But indeed asterisk is working fine and nothing happens with him. I parsed asterisk full log and found nothing, that can explain the behavior of pacemaker. All machines are virtual (not containers, but proxmox VMs). They have enough resources, each has - 8 cores 3GHz, 8GB ram. I tried to increase resources on machines - I doubled them up, but it changed nothing. And it seemed to be that machine resources are not the root of the problem, resources monitoring showed that cores are not loaded more than 10%. Configurations. Corosync config: totem { version: 2 cluster_name: asterisk token: 1000 token_retransmit: 31 hold: 31 token_retransmits_before_loss_const: 0 clear_node_high_bit: yes crypto_cipher: none crypto_hash: none rrp_mode: active transport: udpu interface { member { memberaddr: 10.100.1.1 } member { memberaddr: 10.100.1.2 } ringnumber: 0 bindnetaddr: 10.100.1.1 mcastport: 5405 ttl: 1 } } quorum { provider: corosync_votequorum expected_votes: 2 } logging block is skipped. Pacemaker config: node 178676749: node1-Master node 178676750: node2-Slave primitive ASTERISK systemd:asterisk \ op monitor interval=2s timeout=30s on-fail=restart \ op start interval=0 timeout=30s \ op stop interval=0 timeout=30s \ meta migration-threshold=2 failure-timeout=1800s target-role=Started primitive BACULA systemd:bacula-fd \ op monitor interval=30s timeout=60s on-fail=restart \ op start interval=0 timeout=30s \ op stop interval=0 timeout=30s \ meta migration-threshold=2 failure-timeout=1800s primitive IPSHARED IPaddr2 \ params ip=here.my.real.ip.address nic=ens18 cidr_netmask=29 \ meta migration-threshold=2 target-role=Started \ op monitor interval=20 timeout=60 on-fail=restart primitive SNMP systemd:snmpd \ op monitor interval=30s timeout=60s on-fail=restart \ op start interval=0 timeout=30s \ op stop interval=0 timeout=30s \ meta migration-threshold=2 failure-timeout=1800s target-role=Started order ASTERISK_AFTER_IPSHARED Mandatory: IPSHARED ASTERISK SNMP colocation ASTERISK_WITH_IPSHARED inf: ASTERISK IPSHARED location PREFER_BACULA BACULA 100: node1-Master location PREFER_MASTER ASTERISK 100: node1-Master location PREFER_SNMP SNMP 100: node1-Master property cib-bootstrap-options: \ cluster-recheck-interval=5s \ start-failure-is-fatal=false \ stonith-enabled=false \ no-quorum-policy=ignore \ have-watchdog=false \ dc-version=1.1.16-94ff4df \ cluster-infrastructure=corosync \ cluster-name=virtual2 Asterisk systemd config: [Unit] Description=Asterisk [Service] ExecStart=/etc/init.d/asterisk start ExecStop=/etc/init.d/asterisk stop PIDFile=/var/run/asterisk/asterisk.pid Corosync log: Nov 07 15:06:16 [3958] node1-Master crmd: info: process_lrm_event: Result of monitor operation for ASTERISK on node1-Master: 7 (not running) | call=85 key=ASTERISK_monitor_2000 confirmed=false cib-update=106 Nov 07 15:06:16 [3953] node1-Master cib: info: cib_process_request: Forwarding cib_modify operation for section status to all (origin=local/crmd/106) Nov 07 15:06:16 [3953] node1-Master cib: info: cib_perform_op: Diff: --- 0.38.37 2 Nov 07 15:06:16 [3953] node1-Master cib: info: cib_perform_op: Diff: +++ 0.38.38 (null) Nov 07 15:06:16 [3953] node1-Master cib: info: cib_perform_op: + /cib: @num_updates=38 Nov 07 15:06:16 [3953] node1-Master cib: info: cib_perform_op: + /cib/status/node_state[@id='178676749']/lrm[@id='178676749']/lrm_resources/lrm_resource[@id='ASTERISK']/lrm_rsc_op[@id='ASTERISK_last_failure_0']: @transition-key=2:29672:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934, @transition-magic=0:7;2:29672:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934, @call-id=85, @last-rc-change=1510059976 Nov 07 15:06:16 [3953] node1-Master cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=node1-Master/crmd/106, version=0.38.38) Nov 07 15:06:16 [3956] node1-Master attrd: info: attrd_peer_update: Setting fail-count-ASTERISK[node1-Master]: 6 -> 7 from node2-Slave Nov 07 15:06:16 [3956] node1-Master attrd: info: attrd_peer_update: Setting last-failure-ASTERISK[node1-Master]: 1510059507 -> 1510059976 from node2-Slave Nov 07 15:06:16 [3955] node1-Master lrmd: info: cancel_recurring_action: Cancelling systemd operation SNMP_status_30000 Nov 07 15:06:16 [3958] node1-Master crmd: info: do_lrm_rsc_op: Performing key=10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934 op=SNMP_stop_0 Nov 07 15:06:16 [3955] node1-Master lrmd: info: log_execute: executing - rsc:SNMP action:stop call_id:89 Nov 07 15:06:16 [3958] node1-Master crmd: info: process_lrm_event: Result of monitor operation for SNMP on node1-Master: Cancelled | call=87 key=SNMP_monitor_30000 confirmed=true Nov 07 15:06:16 [3955] node1-Master lrmd: info: systemd_exec_result: Call to stop passed: /org/freedesktop/systemd1/job/10916 Nov 07 15:06:18 [3958] node1-Master crmd: notice: process_lrm_event: Result of stop operation for SNMP on node1-Master: 0 (ok) | call=89 key=SNMP_stop_0 confirmed=true cib-update=107 Nov 07 15:06:18 [3953] node1-Master cib: info: cib_process_request: Forwarding cib_modify operation for section status to all (origin=local/crmd/107) Nov 07 15:06:18 [3953] node1-Master cib: info: cib_perform_op: Diff: --- 0.38.38 2 Nov 07 15:06:18 [3953] node1-Master cib: info: cib_perform_op: Diff: +++ 0.38.39 (null) Nov 07 15:06:18 [3953] node1-Master cib: info: cib_perform_op: + /cib: @num_updates=39 Nov 07 15:06:18 [3953] node1-Master cib: info: cib_perform_op: + /cib/status/node_state[@id='178676749']/lrm[@id='178676749']/lrm_resources/lrm_resource[@id='SNMP']/lrm_rsc_op[@id='SNMP_last_0']: @operation_key=SNMP_stop_0, @operation=stop, @transition-key=10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934, @transition-magic=0:0;10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934, @call-id=89, @last-run=1510059976, @last-rc-change=1510059976, @exec-time=2047 Nov 07 15:06:18 [3953] node1-Master cib: info: cib_process_request: Completed cib_modify operation for section status: OK (rc=0, origin=node1-Master/crmd/107, version=0.38.39) Nov 07 15:06:18 [3955] node1-Master lrmd: info: cancel_recurring_action: Cancelling systemd operation ASTERISK_status_2000 Nov 07 15:06:18 [3958] node1-Master crmd: info: do_lrm_rsc_op: Performing key=3:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934 op=ASTERISK_stop_0 Nov 07 15:06:18 [3955] node1-Master lrmd: info: log_execute: executing - rsc:ASTERISK action:stop call_id:91 Nov 07 15:06:18 [3958] node1-Master crmd: info: process_lrm_event: Result of monitor operation for ASTERISK on node1-Master: Cancelled | call=85 key=ASTERISK_monitor_2000 confirmed=true Nov 07 15:06:18 [3955] node1-Master lrmd: info: systemd_exec_result: Call to stop passed: /org/freedesktop/systemd1/job/10917 Asterisk with the same asterisk's configurations works fine on regular virtual machine (not cluster), with the same resource parameters. So I think, the problem consists of interaction between asterisk monitor (pacemaker) function and asterisk daemon. May be delays or something like. Thanks in advance for answers/hints. -- -- BR, Donat Zenichev Wnet VoIP team Tel: +380(44) 5-900-808 http://wnet.ua
_______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org