Re: [ClusterLabs] clearing failed actions

2017-05-30 Thread Attila Megyeri
Hi Ken,


> -Original Message-
> From: Ken Gaillot [mailto:kgail...@redhat.com]
> Sent: Tuesday, May 30, 2017 4:32 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] clearing failed actions
> 
> On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> > Hi,
> >
> >
> >
> > Shouldn't the
> >
> >
> >
> > cluster-recheck-interval="2m"
> >
> >
> >
> > property instruct pacemaker to recheck the cluster every 2 minutes and
> > clean the failcounts?
> 
> It instructs pacemaker to recalculate whether any actions need to be
> taken (including expiring any failcounts appropriately).
> 
> > At the primitive level I also have a
> >
> >
> >
> > migration-threshold="30" failure-timeout="2m"
> >
> >
> >
> > but whenever I have a failure, it remains there forever.
> >
> >
> >
> >
> >
> > What could be causing this?
> >
> >
> >
> > thanks,
> >
> > Attila
> Is it a single old failure, or a recurring failure? The failure timeout
> works in a somewhat nonintuitive way. Old failures are not individually
> expired. Instead, all failures of a resource are simultaneously cleared
> if all of them are older than the failure-timeout. So if something keeps
> failing repeatedly (more frequently than the failure-timeout), none of
> the failures will be cleared.
> 
> If it's not a repeating failure, something odd is going on.

It is not a repeating failure. Let's say that a resource fails for whatever 
action, It will remain in the failed actions (crm_mon -Af) until I issue a "crm 
resource cleanup ". Even after days or weeks, even though I see 
in the logs that cluster is rechecked every 120 seconds.

How could I troubleshoot this issue?

thanks!


> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Cloned IP not moving back after node restart or standby

2017-05-30 Thread Przemyslaw Kulczycki
Hi.
I'm trying to setup a 2-node corosync+pacemaker cluster to function as an
active-active setup for nginx with a shared IP.

I've discovered (much to my disappointment) that every time I restart one
node or put it in standby, the second instance of the cloned IP gets moved
to the first node and doesn't go back once the second node is available,
even though I have set stickiness to 0.

[upr@webdemo3 ~]$ sudo pcs status
Cluster name: webdemo_cluster2
Stack: corosync
Current DC: webdemo3 (version 1.1.15-11.el7_3.4-e174ec8) - partition with
quorum
Last updated: Tue May 30 18:40:18 2017  Last change: Tue May 30
17:56:24 2017 by hacluster via crmd on webdemo4

2 nodes and 4 resources configured

Online: [ webdemo3 webdemo4 ]

Full list of resources:

 Clone Set: ha-ip-clone [ha-ip] (unique)
 ha-ip:0(ocf::heartbeat:IPaddr2):   Started webdemo3
 ha-ip:1(ocf::heartbeat:IPaddr2):   Started webdemo3
 Clone Set: ha-nginx-clone [ha-nginx] (unique)
 ha-nginx:0 (ocf::heartbeat:nginx): Started webdemo3
 ha-nginx:1 (ocf::heartbeat:nginx): Started webdemo4

Failed Actions:
* ha-nginx:0_monitor_2 on webdemo3 'not running' (7): call=108,
status=complete, exitreason='none',
last-rc-change='Tue May 30 17:56:46 2017', queued=0ms, exec=0ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

[upr@webdemo3 ~]$ sudo pcs config --full
Cluster Name: webdemo_cluster2
Corosync Nodes:
 webdemo3 webdemo4
Pacemaker Nodes:
 webdemo3 webdemo4

Resources:
 Clone: ha-ip-clone
  Meta Attrs: clone-max=2 clone-node-max=2 globally-unique=true
*stickiness=0*
  Resource: ha-ip (class=ocf provider=heartbeat type=IPaddr2)
   Attributes: ip=10.75.39.235 cidr_netmask=24 clusterip_hash=sourceip
   Operations: start interval=0s timeout=20s (ha-ip-start-interval-0s)
   stop interval=0s timeout=20s (ha-ip-stop-interval-0s)
   monitor interval=10s timeout=20s (ha-ip-monitor-interval-10s)
 Clone: ha-nginx-clone
  Meta Attrs: globally-unique=true clone-node-max=1
  Resource: ha-nginx (class=ocf provider=heartbeat type=nginx)
   Operations: start interval=0s timeout=60s (ha-nginx-start-interval-0s)
   stop interval=0s timeout=60s (ha-nginx-stop-interval-0s)
   monitor interval=20s timeout=30s
(ha-nginx-monitor-interval-20s)

Stonith Devices:
Fencing Levels:

Location Constraints:
Ordering Constraints:
Colocation Constraints:
  ha-ip-clone with ha-nginx-clone (score:INFINITY)
(id:colocation-ha-ip-ha-nginx-INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 resource-stickiness: 100
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: webdemo_cluster2
 dc-version: 1.1.15-11.el7_3.4-e174ec8
 have-watchdog: false
 last-lrm-refresh: 1496159785
 no-quorum-policy: ignore
 stonith-enabled: false

Quorum:
  Options:

Am I doing something incorrectly?

Additionally, I'd like to know what's the difference between these commands:

sudo pcs resource update ha-ip-clone stickiness=0

sudo pcs resource meta ha-ip-clone resource-stickiness=0

They seem to set the same thing, but there might be a subtle difference.

-- 
Best Regards

Przemysław Kulczycki
System administrator
Avaleo

Email: u...@avaleo.net
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] clearing failed actions

2017-05-30 Thread Ken Gaillot
On 05/30/2017 09:13 AM, Attila Megyeri wrote:
> Hi,
> 
>  
> 
> Shouldn’t the 
> 
>  
> 
> cluster-recheck-interval="2m"
> 
>  
> 
> property instruct pacemaker to recheck the cluster every 2 minutes and
> clean the failcounts?

It instructs pacemaker to recalculate whether any actions need to be
taken (including expiring any failcounts appropriately).

> At the primitive level I also have a
> 
>  
> 
> migration-threshold="30" failure-timeout="2m"
> 
>  
> 
> but whenever I have a failure, it remains there forever.
> 
>  
> 
>  
> 
> What could be causing this?
> 
>  
> 
> thanks,
> 
> Attila
Is it a single old failure, or a recurring failure? The failure timeout
works in a somewhat nonintuitive way. Old failures are not individually
expired. Instead, all failures of a resource are simultaneously cleared
if all of them are older than the failure-timeout. So if something keeps
failing repeatedly (more frequently than the failure-timeout), none of
the failures will be cleared.

If it's not a repeating failure, something odd is going on.

___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] clearing failed actions

2017-05-30 Thread Attila Megyeri
Hi,

Shouldn't the

cluster-recheck-interval="2m"

property instruct pacemaker to recheck the cluster every 2 minutes and clean 
the failcounts?

At the primitive level I also have a

migration-threshold="30" failure-timeout="2m"

but whenever I have a failure, it remains there forever.


What could be causing this?

thanks,
Attila
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] pacemaker crm config iSCSITarget lio config

2017-05-30 Thread Jelle de Jong

Hello everybody,

I was switching from tgtd to lio as iscsi target and hitting some issues.

I have to manually run the targetcli command after the pacemaker 
resources are started successfully:


/iscsi/iqn.20iscsi0/tpgt1> luns/ create /backstores/iblock/iscsi0_lun0

to get the mappings right... what is wrong with my pacemaker config?

see my attachment or http://paste.debian.net/hidden/d5523f15/

also changing lun="1" to lun="0" in the iSCSILogicalUnit did not seem to 
change the workings.


Kind regards,

Jelle de Jong


root@godfrey:~# crm configure show
node finley \
attributes standby="on"
node godfrey \
attributes standby="off"
primitive drbd_r0 ocf:linbit:drbd \
params drbd_resource="r0" \
op start interval="0" timeout="240s" \
op stop interval="0" timeout="100s" \
op monitor interval="15s" timeout="30s"
primitive ip_virtual0 ocf:heartbeat:IPaddr2 \
params nic="bond0" ip="192.168.24.40" cidr_netmask="32" \
op monitor interval="10s"
primitive ip_virtual1 ocf:heartbeat:IPaddr2 \
params nic="br0" ip="192.168.35.3" cidr_netmask="32" \
op monitor interval="10s"
primitive iscsi0_lun0 ocf:heartbeat:iSCSILogicalUnit \
params implementation="lio" 
target_iqn="iqn.2011-04.nl.powercraft:storage.iscsi0" lun="0" path="/dev/drbd0" 
\
op stop interval="0" timeout="90s" \
op monitor interval="10s"
primitive iscsi0_target ocf:heartbeat:iSCSITarget \
params implementation="lio" iqn="iqn.2011-04.nl.powercraft:storage.iscsi0" 
allowed_initiators="iqn.1993-08.org.debian:01:551e12e22568 
iqn.1993-08.org.debian:01:9753b4d3302c iqn.1993-08.org.debian:01:7786972a49ce" \
op stop interval="0" timeout="120s" \
op monitor interval="10s"
primitive ping_nodes ocf:pacemaker:ping \
params host_list="192.168.24.1 192.168.24.17 192.168.24.18" multiplier="10" 
attempts="1" dampen="1" timeout="1" \
op start interval="0" timeout="60" \
op monitor interval="1" timeout="60"
group rg_iscsi iscsi0_target iscsi0_lun0 ip_virtual0 ip_virtual1
ms ms_drbd_r0 drbd_r0 \
meta clone-max="2" clone-node-max="1" master-max="1" master-node-max="1" 
notify="true" target-role="Master"
clone ping_clone ping_nodes
location drbd-fence-by-handler-r0-ms_drbd_r0 ms_drbd_r0 \
rule $id="drbd-fence-by-handler-r0-rule-ms_drbd_r0" $role="Master" -inf: 
#uname ne godfrey
# location drbd_r0-not-on-ebony ms_drbd_r0 rule -inf: #uname eq ebony
# location drbd_r1-not-on-ebony ms_drbd_r1 rule -inf: #uname eq ebony
# location drbd_r2-not-on-ebony ms_drbd_r2 rule -inf: #uname eq ebony
location drbd_r0-master-on-active-network ms_drbd_r0 \
rule $id="drbd_r0-master-on-active-network-rule" $role="Master" -inf: 
not_defined pingd or pingd lte 0
location iscsi-on-active-network rg_iscsi \
rule $id="iscsi-on-active-network-rule" -inf: not_defined pingd or pingd 
lte 0
colocation iscsi-with-drbd_r0-master inf: rg_iscsi ms_drbd_r0:Master
order iscsi-after-drbd_r0-promote inf: ms_drbd_r0:promote rg_iscsi:start
property $id="cib-bootstrap-options" \
no-quorum-policy="ignore" \
stonith-enabled="false" \
dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
cluster-infrastructure="openais" \
expected-quorum-votes="2" \
last-lrm-refresh="1496135114"
rsc_defaults $id="rsc-options" \
resource-stickiness="200"



root@godfrey:~# targetcli
Welcome to the targetcli shell:

 Copyright (c) 2011 by RisingTide Systems LLC.

Visit us at http://www.risingtidesystems.com.

Using loopback fabric module.
Using iscsi fabric module.
Using qla2xxx fabric module.
Using ib_srpt fabric module.
Using tcm_fc fabric module.
/> ls
o- / 
.
 [...]
  o- backstores 
..
 [...]
  | o- fileio 
...
 [0 Storage Object]
  | o- iblock 
...
 [1 Storage Object]
  | | o- iscsi0_lun0 

 [/dev/drbd0 activated]
  | o- pscsi 

 [0 Storage Object]
  | o- rd_dr 

 [0 Storage Object]
  | o- rd_mcp 
...
 [0 Storage Object]
  o- ib_srpt 

 [0 Target]
  o- iscsi 
..
 [1 Target]
  | o-