On 08/30/2016 05:18 AM, Dejan Muhamedagic wrote: > Hi, > > On Thu, Aug 18, 2016 at 09:00:24AM -0500, Ken Gaillot wrote: >> On 08/17/2016 08:17 PM, TEG AMJG wrote: >>> Hi >>> >>> I am having a problem with a simple Active/Passive cluster which >>> consists in the next configuration >>> >>> Cluster Name: kamcluster >>> Corosync Nodes: >>> kam1vs3 kam2vs3 >>> Pacemaker Nodes: >>> kam1vs3 kam2vs3 >>> >>> Resources: >>> Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2) >>> Attributes: ip=10.0.1.206 cidr_netmask=32 >>> Operations: start interval=0s timeout=20s (ClusterIP-start-interval-0s) >>> stop interval=0s timeout=20s (ClusterIP-stop-interval-0s) >>> monitor interval=10s (ClusterIP-monitor-interval-10s) >>> Resource: ClusterIP2 (class=ocf provider=heartbeat type=IPaddr2) >>> Attributes: ip=10.0.1.207 cidr_netmask=32 >>> Operations: start interval=0s timeout=20s (ClusterIP2-start-interval-0s) >>> stop interval=0s timeout=20s (ClusterIP2-stop-interval-0s) >>> monitor interval=10s (ClusterIP2-monitor-interval-10s) >>> Resource: rtpproxycluster (class=systemd type=rtpproxy) >>> Operations: monitor interval=10s (rtpproxycluster-monitor-interval-10s) >>> stop interval=0s on-fail=block >>> (rtpproxycluster-stop-interval-0s) >>> Resource: kamailioetcfs (class=ocf provider=heartbeat type=Filesystem) >>> Attributes: device=/dev/drbd1 directory=/etc/kamailio fstype=ext4 >>> Operations: start interval=0s timeout=60 (kamailioetcfs-start-interval-0s) >>> monitor interval=10s on-fail=fence >>> (kamailioetcfs-monitor-interval-1 >>> 0s) >>> stop interval=0s on-fail=fence >>> (kamailioetcfs-stop-interval-0s) >>> Clone: fence_kam2_xvm-clone >>> Meta Attrs: interleave=true clone-max=2 clone-node-max=1 >>> Resource: fence_kam2_xvm (class=stonith type=fence_xvm) >>> Attributes: port=tegamjg_kam2 pcmk_host_list=kam2vs3 >>> Operations: monitor interval=60s (fence_kam2_xvm-monitor-interval-60s) >>> Master: kamailioetcclone >>> Meta Attrs: master-max=1 master-node-max=1 clone-max=2 >>> clone-node-max=1 notify=t >>> rue on-fail=fence >>> Resource: kamailioetc (class=ocf provider=linbit type=drbd) >>> Attributes: drbd_resource=kamailioetc >>> Operations: start interval=0s timeout=240 (kamailioetc-start-interval-0s) >>> promote interval=0s on-fail=fence >>> (kamailioetc-promote-interval-0s) >>> demote interval=0s on-fail=fence >>> (kamailioetc-demote-interval-0s) >>> stop interval=0s on-fail=fence (kamailioetc-stop-interval-0s) >>> monitor interval=10s (kamailioetc-monitor-interval-10s) >>> Clone: fence_kam1_xvm-clone >>> Meta Attrs: interleave=true clone-max=2 clone-node-max=1 >>> Resource: fence_kam1_xvm (class=stonith type=fence_xvm) >>> Attributes: port=tegamjg_kam1 pcmk_host_list=kam1vs3 >>> Operations: monitor interval=60s (fence_kam1_xvm-monitor-interval-60s) >>> Resource: kamailiocluster (class=ocf provider=heartbeat type=kamailio) >>> Attributes: listen_address=10.0.1.206 >>> conffile=/etc/kamailio/kamailio.cfg pidfil >>> >>> e=/var/run/kamailio.pid monitoring_ip=10.0.1.206 >>> monitoring_ip2=10.0.1.207 port=50 >>> 60 proto=udp >>> kamctlrc=/etc/kamailio/kamctlrc shmem=128 pkg=8 >>> Meta Attrs: target-role=Stopped >>> Operations: start interval=0s timeout=60 >>> (kamailiocluster-start-interval-0s) >>> stop interval=0s timeout=30 (kamailiocluster-stop-interval-0s) >>> monitor interval=5s (kamailiocluster-monitor-interval-5s) >>> >>> Stonith Devices: >>> Fencing Levels: >>> >>> Location Constraints: >>> Ordering Constraints: >>> start fence_kam1_xvm-clone then start fence_kam2_xvm-clone >>> (kind:Mandatory) (id: >>> >>> order-fence_kam1_xvm-clone-fence_kam2_xvm-clone-mandatory) >>> start fence_kam2_xvm-clone then promote kamailioetcclone >>> (kind:Mandatory) (id:or >>> >>> der-fence_kam2_xvm-clone-kamailioetcclone-mandatory) >>> promote kamailioetcclone then start kamailioetcfs (kind:Optional) >>> (id:order-kama >>> ilioetcclone-kamailioetcfs-Optional) >>> Resource Sets: >>> set kamailioetcfs sequential=true (id:pcs_rsc_set_kamailioetcfs) set >>> ClusterIP >>> ClusterIP2 sequential=false >>> (id:pcs_rsc_set_ClusterIP_ClusterIP2) set rtpproxyclu >>> ster >>> kamailiocluster sequential=true >>> (id:pcs_rsc_set_rtpproxycluster_kamailioclust >>> er) >>> (id:pcs_rsc_order_set_kamailioetcfs_set_ClusterIP_ClusterIP2_set_rtpproxyclust >>> >>> er_kamailiocluster) >>> Colocation Constraints: >>> rtpproxycluster with ClusterIP2 (score:INFINITY) >>> (id:colocation-rtpproxycluster- >>> ClusterIP2-INFINITY) >>> ClusterIP2 with ClusterIP (score:INFINITY) >>> (id:colocation-ClusterIP2-ClusterIP-I >>> NFINITY) >>> ClusterIP with kamailioetcfs (score:INFINITY) >>> (id:colocation-ClusterIP-kamailioe >>> tcfs-INFINITY) >>> kamailioetcfs with kamailioetcclone (score:INFINITY) >>> (with-rsc-role:Master) (id: >>> >>> colocation-kamailioetcfs-kamailioetcclone-INFINITY) >>> fence_kam2_xvm-clone with fence_kam1_xvm-clone (score:INFINITY) >>> (id:colocation-f >>> >>> ence_kam2_xvm-clone-fence_kam1_xvm-clone-INFINITY) >>> kamailioetcclone with fence_kam2_xvm-clone (score:INFINITY) >>> (id:colocation-kamai >>> lioetcclone-fence_kam2_xvm-clone-INFINITY) >>> kamailiocluster with rtpproxycluster (score:INFINITY) >>> (id:colocation-kamailioclu >>> ster-rtpproxycluster-INFINITY) >>> >>> Resources Defaults: >>> migration-threshold: 2 >>> failure-timeout: 10m >>> resource-stickiness: 200 >>> Operations Defaults: >>> No defaults set >>> >>> Cluster Properties: >>> cluster-infrastructure: corosync >>> cluster-name: kamcluster >>> dc-version: 1.1.13-10.el7_2.2-44eb2dd >>> have-watchdog: false >>> last-lrm-refresh: 1471479940 >>> no-quorum-policy: ignore >>> start-failure-is-fatal: false >>> stonith-action: reboot >>> stonith-enabled: true >>> >>> Now my problem is when the rtpproxy systemd resource starts/stops, i am >>> experiencing 2 seconds delay between the call from the start/stop signal >>> is sent to systemd and the confirmation and interpretation of the action >>> to the crmd. I am pretty sure that the initiation of this service doesnt >> >> The problem is that systemd does not have a way to say "start this >> service, and don't tell me it's done until it's completely started". >> Systemd reports successful completion as soon as the start has been >> initiated. > > The same holds for LSB init scripts. However, pacemaker didn't > make any effort to check whether the service (typically some kind > of daemon) was actually started/stopped. > >> By contrast, Pacemaker can't proceed with dependencies until the service >> is completely up. So, Pacemaker's workaround is to initiate the start, >> then poll the status every 2 seconds until systemd reports the service >> is completely up. >> >> The 2 seconds is hardcoded, so there's no way to avoid that at the >> moment, other than using an OCF agent. > > So, is this only for systemd resources or also LSB? Do you know > when the 2 second delay got introduced?
Only systemd and nagios. It was introduced in commit c78b4b8, released with 1.1.12. Coincidentally, we're currently looking at using systemd's DBus signaling, rather than the 2-second polling, to determine when the action has taken effect. That's still in the early stages, but hopefully it will be a better approach. > > Thanks, > > Dejan > >>> take even half of that time to start/stop. i did the same thing with a >>> dummy service as a systemd resource and it als takes 2 seconds to >>> start/stop the resource. Also, this is not happening in any of the ocf >>> resources. I am putting what i see in the logs about this behaviour >>> >>> This is after doing "pcs resource restart rtpproxycluster" >>> >>> Aug 17 20:59:18 [13187] kam1 crmd: notice: te_rsc_command: >>> Initiating action 14: stop rtpproxycluster_stop_0 on kam1vs3 (local) >>> Aug 17 20:59:18 [13184] kam1 lrmd: info: >>> cancel_recurring_action: Cancelling systemd operation >>> rtpproxycluster_status_10000 >>> Aug 17 20:59:18 [13187] kam1 crmd: info: do_lrm_rsc_op: >>> Performing key=14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b >>> op=rtpproxycluster_stop_0 >>> Aug 17 20:59:18 [13184] kam1 lrmd: info: log_execute: >>> executing - rsc:rtpproxycluster action:stop call_id:106 >>> Aug 17 20:59:18 [13187] kam1 crmd: info: process_lrm_event: >>> Operation rtpproxycluster_monitor_10000: Cancelled (node=kam1vs3, >>> call=104, confirmed=true) >>> Aug 17 20:59:18 [13184] kam1 lrmd: info: systemd_exec_result: >>> Call to stop passed: /org/freedesktop/systemd1/job/8302 >>> Aug 17 20:59:20 [13187] kam1 crmd: notice: process_lrm_event: >>> Operation rtpproxycluster_stop_0: ok (node=kam1vs3, call=106, rc=0, >>> cib-update=134, confirmed=true) >>> Aug 17 20:59:20 [13182] kam1 cib: info: cib_perform_op: >>> + >>> >>> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rtpproxycluster']/lrm_rsc_op[@id='rtpproxycluster_last_0']: >>> @operation_key=rtpproxycluster_stop_0, @operation=stop, >>> @transition-key=14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b, >>> @transition-magic=0:0;14:30:0:8a202722-ece2-4617-b26e-8d4aa5f3522b, >>> @call-id=106, @last-run=1471481958, @last-rc-change=1471481958, >>> @exec-time=2116 >>> >>> Aug 17 20:59:20 [13186] kam1 pengine: info: RecurringOp: Start >>> recurring monitor (10s) for rtpproxycluster on kam1vs3 >>> Aug 17 20:59:20 [13186] kam1 pengine: notice: LogActions: Start >>> rtpproxycluster (kam1vs3) >>> Aug 17 20:59:20 [13187] kam1 crmd: notice: te_rsc_command: >>> Initiating action 13: start rtpproxycluster_start_0 on kam1vs3 (local) >>> Aug 17 20:59:20 [13187] kam1 crmd: info: do_lrm_rsc_op: >>> Performing key=13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b >>> op=rtpproxycluster_start_0 >>> Aug 17 20:59:20 [13184] kam1 lrmd: info: log_execute: >>> executing - rsc:rtpproxycluster action:start call_id:107 >>> Aug 17 20:59:21 [13184] kam1 lrmd: info: systemd_exec_result: >>> Call to start passed: /org/freedesktop/systemd1/job/8303 >>> Aug 17 20:59:23 [13187] kam1 crmd: notice: process_lrm_event: >>> Operation rtpproxycluster_start_0: ok (node=kam1vs3, call=107, rc=0, >>> cib-update=136, confirmed=true) >>> Aug 17 20:59:23 [13182] kam1 cib: info: cib_perform_op: >>> + >>> >>> /cib/status/node_state[@id='1']/lrm[@id='1']/lrm_resources/lrm_resource[@id='rtpproxycluster']/lrm_rsc_op[@id='rtpproxycluster_last_0']: >>> @operation_key=rtpproxycluster_start_0, @operation=start, >>> @transition-key=13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b, >>> @transition-magic=0:0;13:31:0:8a202722-ece2-4617-b26e-8d4aa5f3522b, >>> @call-id=107, @last-run=1471481960, @last-rc-change=1471481960, >>> @exec-time=2068 >>> >>> Why is this happening?, many of my resources depends on each other, so >>> the delay it takes when failover its needed is quite a lot (just 4 >>> seconds of delay stopping/starting just for that systemd resource. >>> >>> Is this normal?, is there anyway to solve this?, Do i really need to >>> make an OCF resource for that service (that would take some time that i >>> dont really have at this moment) >>> >>> Regards >>> >>> Alejandro _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org