Hi! It may be worth the time to find out which command is used to monitor the status of asterisk. Then maybe run that command repeatedly in a shell loop to find out what it returns.
I guess some thing is slow to respond; maybe jsut increase the timeout... Regards, Ulrich >>> Donat Zenichev <donat.zenic...@gmail.com> schrieb am 07.11.2017 um 17:07 in Nachricht <canlwqcnvpppm7s7kzh0p42wxyqrqh62wm3vnzf8gea0yuh0...@mail.gmail.com>: > Hi guys. > I've just started to work with pacemaker and have a problem with monitored > service. > > I've already configured three Active/Stand-by clusters with pacemaker. > Running resources: > IPaddr2 > asterisk daemon > bacula fd > snmp daemon > > First and second cluster are working fine - I didn't notice any failures. > But the third cluster fails too frequently. > > Schema is similar for all clusters: > Master (active) <-corosync-> Slave(Stand-by) > > There is no difference between three cluster, apart server loading. > Asterisk that is running on third cluster has 500+ customers and processes > much more calls than others. > > So that, cluster periodically thinks that asterisk is not running: > * ASTERISK_monitor_2000 on node1-Master 'not running' (7): call=85, > status=complete, exitreason='none', > last-rc-change='Tue Nov 7 15:06:16 2017', queued=0ms, exec=0ms > > And restarts it (because on-fail=restart parameter for asterisk primitive). > But indeed asterisk is working fine and nothing happens with him. > I parsed asterisk full log and found nothing, that can explain the behavior > of pacemaker. > > All machines are virtual (not containers, but proxmox VMs). They have > enough resources, each has - 8 cores 3GHz, 8GB ram. > I tried to increase resources on machines - I doubled them up, but it > changed nothing. > And it seemed to be that machine resources are not the root of the problem, > resources monitoring showed that cores are not loaded more than 10%. > > Configurations. > > Corosync config: > totem { > version: 2 > cluster_name: asterisk > > token: 1000 > token_retransmit: 31 > hold: 31 > > token_retransmits_before_loss_const: 0 > > clear_node_high_bit: yes > > crypto_cipher: none > > crypto_hash: none > rrp_mode: active > transport: udpu > > interface { > member { > memberaddr: 10.100.1.1 > } > member { > memberaddr: 10.100.1.2 > } > > ringnumber: 0 > bindnetaddr: 10.100.1.1 > > mcastport: 5405 > ttl: 1 > } > } > > quorum { > provider: corosync_votequorum > expected_votes: 2 > } > > logging block is skipped. > > > > Pacemaker config: > > node 178676749: node1-Master > node 178676750: node2-Slave > primitive ASTERISK systemd:asterisk \ > op monitor interval=2s timeout=30s on-fail=restart \ > op start interval=0 timeout=30s \ > op stop interval=0 timeout=30s \ > meta migration-threshold=2 failure-timeout=1800s target-role=Started > primitive BACULA systemd:bacula-fd \ > op monitor interval=30s timeout=60s on-fail=restart \ > op start interval=0 timeout=30s \ > op stop interval=0 timeout=30s \ > meta migration-threshold=2 failure-timeout=1800s > primitive IPSHARED IPaddr2 \ > params ip=here.my.real.ip.address nic=ens18 cidr_netmask=29 \ > meta migration-threshold=2 target-role=Started \ > op monitor interval=20 timeout=60 on-fail=restart > primitive SNMP systemd:snmpd \ > op monitor interval=30s timeout=60s on-fail=restart \ > op start interval=0 timeout=30s \ > op stop interval=0 timeout=30s \ > meta migration-threshold=2 failure-timeout=1800s target-role=Started > order ASTERISK_AFTER_IPSHARED Mandatory: IPSHARED ASTERISK SNMP > colocation ASTERISK_WITH_IPSHARED inf: ASTERISK IPSHARED > location PREFER_BACULA BACULA 100: node1-Master > location PREFER_MASTER ASTERISK 100: node1-Master > location PREFER_SNMP SNMP 100: node1-Master > property cib-bootstrap-options: \ > cluster-recheck-interval=5s \ > start-failure-is-fatal=false \ > stonith-enabled=false \ > no-quorum-policy=ignore \ > have-watchdog=false \ > dc-version=1.1.16-94ff4df \ > cluster-infrastructure=corosync \ > cluster-name=virtual2 > > > Asterisk systemd config: > > [Unit] > Description=Asterisk > > [Service] > ExecStart=/etc/init.d/asterisk start > ExecStop=/etc/init.d/asterisk stop > PIDFile=/var/run/asterisk/asterisk.pid > > > > Corosync log: > > Nov 07 15:06:16 [3958] node1-Master crmd: info: > process_lrm_event: Result > of monitor operation for ASTERISK on node1-Master: 7 (not running) | > call=85 key=ASTERISK_monitor_2000 confirmed=false cib-update=106 > Nov 07 15:06:16 [3953] node1-Master cib: info: > cib_process_request: Forwarding cib_modify operation for section status to > all (origin=local/crmd/106) > Nov 07 15:06:16 [3953] node1-Master cib: info: cib_perform_op: > Diff: > --- 0.38.37 2 > Nov 07 15:06:16 [3953] node1-Master cib: info: cib_perform_op: > Diff: > +++ 0.38.38 (null) > Nov 07 15:06:16 [3953] node1-Master cib: info: cib_perform_op: + > /cib: @num_updates=38 > Nov 07 15:06:16 [3953] node1-Master cib: info: cib_perform_op: + > /cib/status/node_state[@id='178676749']/lrm[@id='178676749']/lrm_resources/l > rm_resource[@id='ASTERISK']/lrm_rsc_op[@id='ASTERISK_last_failure_0']: > @transition-key=2:29672:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934, > @transition-magic=0:7;2:29672:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934, > @call-id=85, @last-rc-change=1510059976 > Nov 07 15:06:16 [3953] node1-Master cib: info: > cib_process_request: Completed cib_modify operation for section status: OK > (rc=0, origin=node1-Master/crmd/106, version=0.38.38) > Nov 07 15:06:16 [3956] node1-Master attrd: info: > attrd_peer_update: Setting > fail-count-ASTERISK[node1-Master]: 6 -> 7 from node2-Slave > Nov 07 15:06:16 [3956] node1-Master attrd: info: > attrd_peer_update: Setting > last-failure-ASTERISK[node1-Master]: 1510059507 -> 1510059976 from > node2-Slave > Nov 07 15:06:16 [3955] node1-Master lrmd: info: > cancel_recurring_action: Cancelling systemd operation SNMP_status_30000 > Nov 07 15:06:16 [3958] node1-Master crmd: info: > do_lrm_rsc_op: Performing > key=10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934 op=SNMP_stop_0 > Nov 07 15:06:16 [3955] node1-Master lrmd: info: log_execute: > executing > - rsc:SNMP action:stop call_id:89 > Nov 07 15:06:16 [3958] node1-Master crmd: info: > process_lrm_event: Result > of monitor operation for SNMP on node1-Master: Cancelled | call=87 > key=SNMP_monitor_30000 confirmed=true > Nov 07 15:06:16 [3955] node1-Master lrmd: info: > systemd_exec_result: Call to stop passed: > /org/freedesktop/systemd1/job/10916 > Nov 07 15:06:18 [3958] node1-Master crmd: notice: > process_lrm_event: Result > of stop operation for SNMP on node1-Master: 0 (ok) | call=89 > key=SNMP_stop_0 confirmed=true cib-update=107 > Nov 07 15:06:18 [3953] node1-Master cib: info: > cib_process_request: Forwarding cib_modify operation for section status to > all (origin=local/crmd/107) > Nov 07 15:06:18 [3953] node1-Master cib: info: cib_perform_op: > Diff: > --- 0.38.38 2 > Nov 07 15:06:18 [3953] node1-Master cib: info: cib_perform_op: > Diff: > +++ 0.38.39 (null) > Nov 07 15:06:18 [3953] node1-Master cib: info: cib_perform_op: + > /cib: @num_updates=39 > Nov 07 15:06:18 [3953] node1-Master cib: info: cib_perform_op: + > /cib/status/node_state[@id='178676749']/lrm[@id='178676749']/lrm_resources/l > rm_resource[@id='SNMP']/lrm_rsc_op[@id='SNMP_last_0']: > @operation_key=SNMP_stop_0, @operation=stop, > @transition-key=10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934, > @transition-magic=0:0;10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934, > @call-id=89, @last-run=1510059976, @last-rc-change=1510059976, > @exec-time=2047 > Nov 07 15:06:18 [3953] node1-Master cib: info: > cib_process_request: Completed cib_modify operation for section status: OK > (rc=0, origin=node1-Master/crmd/107, version=0.38.39) > Nov 07 15:06:18 [3955] node1-Master lrmd: info: > cancel_recurring_action: Cancelling systemd operation ASTERISK_status_2000 > Nov 07 15:06:18 [3958] node1-Master crmd: info: > do_lrm_rsc_op: Performing > key=3:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934 op=ASTERISK_stop_0 > Nov 07 15:06:18 [3955] node1-Master lrmd: info: log_execute: > executing > - rsc:ASTERISK action:stop call_id:91 > Nov 07 15:06:18 [3958] node1-Master crmd: info: > process_lrm_event: Result > of monitor operation for ASTERISK on node1-Master: Cancelled | call=85 > key=ASTERISK_monitor_2000 confirmed=true > Nov 07 15:06:18 [3955] node1-Master lrmd: info: > systemd_exec_result: Call to stop passed: > /org/freedesktop/systemd1/job/10917 > > > > Asterisk with the same asterisk's configurations works fine on regular > virtual machine (not cluster), with the same resource parameters. > So I think, the problem consists of interaction between asterisk monitor > (pacemaker) function and asterisk daemon. May be delays or something like. > > Thanks in advance for answers/hints. > > > -- > -- > BR, Donat Zenichev > Wnet VoIP team > Tel: +380(44) 5-900-808 > http://wnet.ua _______________________________________________ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org