Re: [ClusterLabs] clearing failed actions
Hi Ken, > -Original Message- > From: Ken Gaillot [mailto:kgail...@redhat.com] > Sent: Tuesday, May 30, 2017 4:32 PM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] clearing failed actions > > On 05/30/2017 09:13 AM, Attila Megyeri wrote: > > Hi, > > > > > > > > Shouldn't the > > > > > > > > cluster-recheck-interval="2m" > > > > > > > > property instruct pacemaker to recheck the cluster every 2 minutes and > > clean the failcounts? > > It instructs pacemaker to recalculate whether any actions need to be > taken (including expiring any failcounts appropriately). > > > At the primitive level I also have a > > > > > > > > migration-threshold="30" failure-timeout="2m" > > > > > > > > but whenever I have a failure, it remains there forever. > > > > > > > > > > > > What could be causing this? > > > > > > > > thanks, > > > > Attila > Is it a single old failure, or a recurring failure? The failure timeout > works in a somewhat nonintuitive way. Old failures are not individually > expired. Instead, all failures of a resource are simultaneously cleared > if all of them are older than the failure-timeout. So if something keeps > failing repeatedly (more frequently than the failure-timeout), none of > the failures will be cleared. > > If it's not a repeating failure, something odd is going on. It is not a repeating failure. Let's say that a resource fails for whatever action, It will remain in the failed actions (crm_mon -Af) until I issue a "crm resource cleanup ". Even after days or weeks, even though I see in the logs that cluster is rechecked every 120 seconds. How could I troubleshoot this issue? thanks! > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Cloned IP not moving back after node restart or standby
Hi. I'm trying to setup a 2-node corosync+pacemaker cluster to function as an active-active setup for nginx with a shared IP. I've discovered (much to my disappointment) that every time I restart one node or put it in standby, the second instance of the cloned IP gets moved to the first node and doesn't go back once the second node is available, even though I have set stickiness to 0. [upr@webdemo3 ~]$ sudo pcs status Cluster name: webdemo_cluster2 Stack: corosync Current DC: webdemo3 (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum Last updated: Tue May 30 18:40:18 2017 Last change: Tue May 30 17:56:24 2017 by hacluster via crmd on webdemo4 2 nodes and 4 resources configured Online: [ webdemo3 webdemo4 ] Full list of resources: Clone Set: ha-ip-clone [ha-ip] (unique) ha-ip:0(ocf::heartbeat:IPaddr2): Started webdemo3 ha-ip:1(ocf::heartbeat:IPaddr2): Started webdemo3 Clone Set: ha-nginx-clone [ha-nginx] (unique) ha-nginx:0 (ocf::heartbeat:nginx): Started webdemo3 ha-nginx:1 (ocf::heartbeat:nginx): Started webdemo4 Failed Actions: * ha-nginx:0_monitor_2 on webdemo3 'not running' (7): call=108, status=complete, exitreason='none', last-rc-change='Tue May 30 17:56:46 2017', queued=0ms, exec=0ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled [upr@webdemo3 ~]$ sudo pcs config --full Cluster Name: webdemo_cluster2 Corosync Nodes: webdemo3 webdemo4 Pacemaker Nodes: webdemo3 webdemo4 Resources: Clone: ha-ip-clone Meta Attrs: clone-max=2 clone-node-max=2 globally-unique=true *stickiness=0* Resource: ha-ip (class=ocf provider=heartbeat type=IPaddr2) Attributes: ip=10.75.39.235 cidr_netmask=24 clusterip_hash=sourceip Operations: start interval=0s timeout=20s (ha-ip-start-interval-0s) stop interval=0s timeout=20s (ha-ip-stop-interval-0s) monitor interval=10s timeout=20s (ha-ip-monitor-interval-10s) Clone: ha-nginx-clone Meta Attrs: globally-unique=true clone-node-max=1 Resource: ha-nginx (class=ocf provider=heartbeat type=nginx) Operations: start interval=0s timeout=60s (ha-nginx-start-interval-0s) stop interval=0s timeout=60s (ha-nginx-stop-interval-0s) monitor interval=20s timeout=30s (ha-nginx-monitor-interval-20s) Stonith Devices: Fencing Levels: Location Constraints: Ordering Constraints: Colocation Constraints: ha-ip-clone with ha-nginx-clone (score:INFINITY) (id:colocation-ha-ip-ha-nginx-INFINITY) Ticket Constraints: Alerts: No alerts defined Resources Defaults: resource-stickiness: 100 Operations Defaults: No defaults set Cluster Properties: cluster-infrastructure: corosync cluster-name: webdemo_cluster2 dc-version: 1.1.15-11.el7_3.4-e174ec8 have-watchdog: false last-lrm-refresh: 1496159785 no-quorum-policy: ignore stonith-enabled: false Quorum: Options: Am I doing something incorrectly? Additionally, I'd like to know what's the difference between these commands: sudo pcs resource update ha-ip-clone stickiness=0 sudo pcs resource meta ha-ip-clone resource-stickiness=0 They seem to set the same thing, but there might be a subtle difference. -- Best Regards Przemysław Kulczycki System administrator Avaleo Email: u...@avaleo.net ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] clearing failed actions
On 05/30/2017 09:13 AM, Attila Megyeri wrote: > Hi, > > > > Shouldn’t the > > > > cluster-recheck-interval="2m" > > > > property instruct pacemaker to recheck the cluster every 2 minutes and > clean the failcounts? It instructs pacemaker to recalculate whether any actions need to be taken (including expiring any failcounts appropriately). > At the primitive level I also have a > > > > migration-threshold="30" failure-timeout="2m" > > > > but whenever I have a failure, it remains there forever. > > > > > > What could be causing this? > > > > thanks, > > Attila Is it a single old failure, or a recurring failure? The failure timeout works in a somewhat nonintuitive way. Old failures are not individually expired. Instead, all failures of a resource are simultaneously cleared if all of them are older than the failure-timeout. So if something keeps failing repeatedly (more frequently than the failure-timeout), none of the failures will be cleared. If it's not a repeating failure, something odd is going on. ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] clearing failed actions
Hi, Shouldn't the cluster-recheck-interval="2m" property instruct pacemaker to recheck the cluster every 2 minutes and clean the failcounts? At the primitive level I also have a migration-threshold="30" failure-timeout="2m" but whenever I have a failure, it remains there forever. What could be causing this? thanks, Attila ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] pacemaker crm config iSCSITarget lio config
Hello everybody, I was switching from tgtd to lio as iscsi target and hitting some issues. I have to manually run the targetcli command after the pacemaker resources are started successfully: /iscsi/iqn.20iscsi0/tpgt1> luns/ create /backstores/iblock/iscsi0_lun0 to get the mappings right... what is wrong with my pacemaker config? see my attachment or http://paste.debian.net/hidden/d5523f15/ also changing lun="1" to lun="0" in the iSCSILogicalUnit did not seem to change the workings. Kind regards, Jelle de Jong root@godfrey:~# crm configure show node finley \ attributes standby="on" node godfrey \ attributes standby="off" primitive drbd_r0 ocf:linbit:drbd \ params drbd_resource="r0" \ op start interval="0" timeout="240s" \ op stop interval="0" timeout="100s" \ op monitor interval="15s" timeout="30s" primitive ip_virtual0 ocf:heartbeat:IPaddr2 \ params nic="bond0" ip="192.168.24.40" cidr_netmask="32" \ op monitor interval="10s" primitive ip_virtual1 ocf:heartbeat:IPaddr2 \ params nic="br0" ip="192.168.35.3" cidr_netmask="32" \ op monitor interval="10s" primitive iscsi0_lun0 ocf:heartbeat:iSCSILogicalUnit \ params implementation="lio" target_iqn="iqn.2011-04.nl.powercraft:storage.iscsi0" lun="0" path="/dev/drbd0" \ op stop interval="0" timeout="90s" \ op monitor interval="10s" primitive iscsi0_target ocf:heartbeat:iSCSITarget \ params implementation="lio" iqn="iqn.2011-04.nl.powercraft:storage.iscsi0" allowed_initiators="iqn.1993-08.org.debian:01:551e12e22568 iqn.1993-08.org.debian:01:9753b4d3302c iqn.1993-08.org.debian:01:7786972a49ce" \ op stop interval="0" timeout="120s" \ op monitor interval="10s" primitive ping_nodes ocf:pacemaker:ping \ params host_list="192.168.24.1 192.168.24.17 192.168.24.18" multiplier="10" attempts="1" dampen="1" timeout="1" \ op start interval="0" timeout="60" \ op monitor interval="1" timeout="60" group rg_iscsi iscsi0_target iscsi0_lun0 ip_virtual0 ip_virtual1 ms ms_drbd_r0 drbd_r0 \ meta clone-max="2" clone-node-max="1" master-max="1" master-node-max="1" notify="true" target-role="Master" clone ping_clone ping_nodes location drbd-fence-by-handler-r0-ms_drbd_r0 ms_drbd_r0 \ rule $id="drbd-fence-by-handler-r0-rule-ms_drbd_r0" $role="Master" -inf: #uname ne godfrey # location drbd_r0-not-on-ebony ms_drbd_r0 rule -inf: #uname eq ebony # location drbd_r1-not-on-ebony ms_drbd_r1 rule -inf: #uname eq ebony # location drbd_r2-not-on-ebony ms_drbd_r2 rule -inf: #uname eq ebony location drbd_r0-master-on-active-network ms_drbd_r0 \ rule $id="drbd_r0-master-on-active-network-rule" $role="Master" -inf: not_defined pingd or pingd lte 0 location iscsi-on-active-network rg_iscsi \ rule $id="iscsi-on-active-network-rule" -inf: not_defined pingd or pingd lte 0 colocation iscsi-with-drbd_r0-master inf: rg_iscsi ms_drbd_r0:Master order iscsi-after-drbd_r0-promote inf: ms_drbd_r0:promote rg_iscsi:start property $id="cib-bootstrap-options" \ no-quorum-policy="ignore" \ stonith-enabled="false" \ dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ last-lrm-refresh="1496135114" rsc_defaults $id="rsc-options" \ resource-stickiness="200" root@godfrey:~# targetcli Welcome to the targetcli shell: Copyright (c) 2011 by RisingTide Systems LLC. Visit us at http://www.risingtidesystems.com. Using loopback fabric module. Using iscsi fabric module. Using qla2xxx fabric module. Using ib_srpt fabric module. Using tcm_fc fabric module. /> ls o- / . [...] o- backstores .. [...] | o- fileio ... [0 Storage Object] | o- iblock ... [1 Storage Object] | | o- iscsi0_lun0 [/dev/drbd0 activated] | o- pscsi [0 Storage Object] | o- rd_dr [0 Storage Object] | o- rd_mcp ... [0 Storage Object] o- ib_srpt [0 Target] o- iscsi .. [1 Target] | o-