Re: [ClusterLabs] Ocassionally IPaddr2 resource fails to start

2019-10-21 Thread Donat Zenichev
Hello and sorry for soo late response of mine, I somehow missed your answer.

Sure let me share a bit of useful information on the count.
First of all the system specific things are:
- Hypervisor is a usual VMware product - VSphere
- VMs OS is: Ubuntu 18.04 LTS
- Pacemaker is of version: 1.1.18-0ubuntu1.1

And yes it's IProute, that has a version - 4.15.0-2ubuntu1

To be mentioned that after I moved to another way of handling this (with
set failure-timeout ) I haven't seen any errors so far, on-fail action
still remains "restart".
But it's obvious, failure-timeout just clears all fail counters for me, so
I don't see any fails now.

Another thing to be mentioned, that monitor functionality for IPaddr2
resource was failing in the years past as well, I just didn't pay much
attention on that.
That time VM machines under my control were working over Ubuntu 14.04 and
hypervisor was - Proxmox of the branch 5+ (cannot exactly remember the
version, perhaps that was 5.4+).

For one this could be a critical case indeed, since sometimes an absence of
IP address (for a certain DB for e.g. with loading of hundreds of thousands
SQL requests) can lead to a huge out age.
I don't have the first idea of how to investigate this further. But, I have
a staging setup where my hands are not tied, so let me know if we can
research something.

And have a nice day!

On Mon, Oct 7, 2019 at 7:21 PM Jan Pokorný  wrote:

> Donat,
>
> On 07/10/19 09:24 -0500, Ken Gaillot wrote:
> > If this always happens when the VM is being snapshotted, you can put
> > the cluster in maintenance mode (or even unmanage just the IP
> > resource) while the snapshotting is happening. I don't know of any
> > reason why snapshotting would affect only an IP, though.
>
> it might be interesting if you could share the details to grow the
> shared knowledge and experience in case there are some instances of
> these problems reported in the future.
>
> In particular, it'd be interesting to hear:
>
> - hypervisor
>
> - VM OS + if plain oblivious to running virtualized,
>   or "the optimal arrangement" (e.g., specialized drivers, virtio,
>   "guest additions", etc.)
>
> (I think IPaddr2 is iproute2-only, hence in turn, VM OS must be Linux)
>
> Of course, there might be more specific things to look at if anyone
> here is an expert with particular hypervisor technology and the way
> the networking works with it (no, not me at all).
>
> --
> Poki
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/



-- 

Best regards,
Donat Zenichev
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

Re: [ClusterLabs] Antw: Ocassionally IPaddr2 resource fails to start

2019-10-07 Thread Donat Zenichev
Hello and thank you for your answer!

So should I just disable "monitor" options at all? In my case  I'd better
delete the whole "op" row:
"op monitor interval=20 timeout=60 on-fail=restart"

am I correct?

On Mon, Oct 7, 2019 at 2:36 PM Ulrich Windl <
ulrich.wi...@rz.uni-regensburg.de> wrote:

> Hi!
>
> I can't remember the exact reason, but probably it was exactly that what
> made us remove any monitor operation from IPaddr2 (back in 2011). So far no
> problems doing so ;-)
>
>
> Regards,
> Ulrich
> P.S.: Of cource it would be nice if the real issue could be found and
> fixed.
>
> >>> Donat Zenichev  schrieb am 20.09.2019 um
> 14:43 in
> Nachricht
> :
> > Hi there!
> >
> > I've got a tricky case, when my IpAddr2 resource fails to start with
> > literally no-reason:
> > "IPSHARED_monitor_2 on my-master-1 'not running' (7): call=11,
> > status=complete, exitreason='',
> >last-rc-change='Wed Sep 4 06:08:07 2019', queued=0ms, exec=0ms"
> >
> > Resource IpAddr2 managed to fix itself and continued to work properly
> > further after that.
> >
> > What I've done after, was setting 'Failure-timeout=900' seconds for my
> > IpAddr2 resource, to prevent working of
> > the resource on a node where it fails. I also set the
> > 'migration-threshold=2' so IpAddr2 can fail only 2 times, and goes to a
> > Slave side after that. Meanwhile Master gets banned for 900 seconds.
> >
> > After 900 seconds cluster tries to start IpAddr2 again at Master, in case
> > it's ok, fail counter gets cleared.
> > That's how I avoid appearing of the error I mentioned above.
> >
> > I tried to get so hard, why this can happen, but still no idea on the
> > count. Any clue how to find a reason?
> > And another question, can snap-shoting of VM machines have any impact on
> > such?
> >
> > And my configurations:
> > ---
> > node 01: my-master-1
> > node 02: my-master-2
> >
> > primitive IPSHARED IPaddr2 \
> > params ip=10.10.10.5 nic=eth0 cidr_netmask=24 \
> > meta migration-threshold=2 failure-timeout=900 target-role=Started \
> > op monitor interval=20 timeout=60 on-fail=restart
> >
> > location PREFER_MASTER IPSHARED 100: my-master-1
> >
> > property cib-bootstrap-options: \
> > have-watchdog=false \
> > dc-version=1.1.18-2b07d5c5a9 \
> > cluster-infrastructure=corosync \
> > cluster-name=wall \
> > cluster-recheck-interval=5s \
> > start-failure-is-fatal=false \
> > stonith-enabled=false \
> > no-quorum-policy=ignore \
> > last-lrm-refresh=1554982967
> > ---
> >
> > Thanks in advance!
> >
> > --
> > --
> > BR, Donat Zenichev
>
>
>
>
> ___
> Manage your subscription:
> https://lists.clusterlabs.org/mailman/listinfo/users
>
> ClusterLabs home: https://www.clusterlabs.org/
>


-- 

Best regards,
Donat Zenichev
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] Ocassionally IPaddr2 resource fails to start

2019-09-20 Thread Donat Zenichev
Hi there!

I've got a tricky case, when my IpAddr2 resource fails to start with
literally no-reason:
"IPSHARED_monitor_2 on my-master-1 'not running' (7): call=11,
status=complete, exitreason='',
   last-rc-change='Wed Sep 4 06:08:07 2019', queued=0ms, exec=0ms"

Resource IpAddr2 managed to fix itself and continued to work properly
further after that.

What I've done after, was setting 'Failure-timeout=900' seconds for my
IpAddr2 resource, to prevent working of
the resource on a node where it fails. I also set the
'migration-threshold=2' so IpAddr2 can fail only 2 times, and goes to a
Slave side after that. Meanwhile Master gets banned for 900 seconds.

After 900 seconds cluster tries to start IpAddr2 again at Master, in case
it's ok, fail counter gets cleared.
That's how I avoid appearing of the error I mentioned above.

I tried to get so hard, why this can happen, but still no idea on the
count. Any clue how to find a reason?
And another question, can snap-shoting of VM machines have any impact on
such?

And my configurations:
---
node 01: my-master-1
node 02: my-master-2

primitive IPSHARED IPaddr2 \
params ip=10.10.10.5 nic=eth0 cidr_netmask=24 \
meta migration-threshold=2 failure-timeout=900 target-role=Started \
op monitor interval=20 timeout=60 on-fail=restart

location PREFER_MASTER IPSHARED 100: my-master-1

property cib-bootstrap-options: \
have-watchdog=false \
dc-version=1.1.18-2b07d5c5a9 \
cluster-infrastructure=corosync \
cluster-name=wall \
cluster-recheck-interval=5s \
start-failure-is-fatal=false \
stonith-enabled=false \
no-quorum-policy=ignore \
last-lrm-refresh=1554982967
---

Thanks in advance!

-- 
-- 
BR, Donat Zenichev
___
Manage your subscription:
https://lists.clusterlabs.org/mailman/listinfo/users

ClusterLabs home: https://www.clusterlabs.org/

[ClusterLabs] monitor operation for ASTERISK on node_name: 7 (not running)

2017-11-08 Thread Donat Zenichev
  info:
cib_process_request: Completed cib_modify operation for section status: OK
(rc=0, origin=node1-Master/crmd/106, version=0.38.38)
Nov 07 15:06:16 [3956] node1-Master  attrd: info:
attrd_peer_update: Setting
fail-count-ASTERISK[node1-Master]: 6 -> 7 from node2-Slave
Nov 07 15:06:16 [3956] node1-Master  attrd: info:
attrd_peer_update: Setting
last-failure-ASTERISK[node1-Master]: 1510059507 -> 1510059976 from
node2-Slave
Nov 07 15:06:16 [3955] node1-Master   lrmd: info:
cancel_recurring_action: Cancelling systemd operation SNMP_status_3
Nov 07 15:06:16 [3958] node1-Master   crmd: info:
do_lrm_rsc_op: Performing
key=10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934 op=SNMP_stop_0
Nov 07 15:06:16 [3955] node1-Master   lrmd: info: log_execute:
executing
- rsc:SNMP action:stop call_id:89
Nov 07 15:06:16 [3958] node1-Master   crmd: info:
process_lrm_event: Result
of monitor operation for SNMP on node1-Master: Cancelled | call=87
key=SNMP_monitor_3 confirmed=true
Nov 07 15:06:16 [3955] node1-Master   lrmd: info:
systemd_exec_result: Call to stop passed:
/org/freedesktop/systemd1/job/10916
Nov 07 15:06:18 [3958] node1-Master   crmd:   notice:
process_lrm_event: Result
of stop operation for SNMP on node1-Master: 0 (ok) | call=89
key=SNMP_stop_0 confirmed=true cib-update=107
Nov 07 15:06:18 [3953] node1-Mastercib: info:
cib_process_request: Forwarding cib_modify operation for section status to
all (origin=local/crmd/107)
Nov 07 15:06:18 [3953] node1-Mastercib: info: cib_perform_op: Diff:
--- 0.38.38 2
Nov 07 15:06:18 [3953] node1-Mastercib: info: cib_perform_op: Diff:
+++ 0.38.39 (null)
Nov 07 15:06:18 [3953] node1-Mastercib: info: cib_perform_op: +
/cib:  @num_updates=39
Nov 07 15:06:18 [3953] node1-Mastercib: info: cib_perform_op: +
/cib/status/node_state[@id='178676749']/lrm[@id='178676749']/lrm_resources/lrm_resource[@id='SNMP']/lrm_rsc_op[@id='SNMP_last_0']:
@operation_key=SNMP_stop_0, @operation=stop,
@transition-key=10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934,
@transition-magic=0:0;10:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934,
@call-id=89, @last-run=1510059976, @last-rc-change=1510059976,
@exec-time=2047
Nov 07 15:06:18 [3953] node1-Mastercib: info:
cib_process_request: Completed cib_modify operation for section status: OK
(rc=0, origin=node1-Master/crmd/107, version=0.38.39)
Nov 07 15:06:18 [3955] node1-Master   lrmd: info:
cancel_recurring_action: Cancelling systemd operation ASTERISK_status_2000
Nov 07 15:06:18 [3958] node1-Master   crmd: info:
do_lrm_rsc_op: Performing
key=3:29764:0:d96266b4-0e4d-4718-8af5-7b6e2edf4934 op=ASTERISK_stop_0
Nov 07 15:06:18 [3955] node1-Master   lrmd: info: log_execute:
executing
- rsc:ASTERISK action:stop call_id:91
Nov 07 15:06:18 [3958] node1-Master   crmd: info:
process_lrm_event: Result
of monitor operation for ASTERISK on node1-Master: Cancelled | call=85
key=ASTERISK_monitor_2000 confirmed=true
Nov 07 15:06:18 [3955] node1-Master   lrmd: info:
systemd_exec_result: Call to stop passed:
/org/freedesktop/systemd1/job/10917



Asterisk with the same asterisk's configurations works fine on regular
virtual machine (not cluster), with the same resource parameters.
So I think, the problem consists of interaction between asterisk monitor
(pacemaker) function and asterisk daemon. May be delays or something like.

Thanks in advance for answers/hints.


-- 
-- 
BR, Donat Zenichev
Wnet VoIP team
Tel:  +380(44) 5-900-808
http://wnet.ua
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] ClusterMon mail notification - does not work

2017-10-10 Thread Donat Zenichev
Hi all.

I've just tried to add an e-mail notification for my cluster built with
pacemaker, but ClusterMon seems doesn't work as expected.

I have ssmtp configured on ubuntu 16.04

I configured cluster monitor by crmsh:
primitive cluster_mon ocf:pacemaker:ClusterMon \
params pidfile="/var/run/crm/crm_mon.pid" htmlfile="/var/tmp/crm_mon.html"
extra_options="--mail-from=nore...@domain.ua --mail-to=do...@domain.ua
 --mail-host=mail.domain.ua:465" \
op start interval=0 timeout=90s \
op stop interval=0 timeout=100s \
op monitor interval=10s timeout=20s \
meta target-role=Started

real domain is changed in this example.

So that, when I execute certain resource to stop:
crm: resource stop IPSHARED

then resource is stopped, but nothing occurred on e-mail destination.
Where I did wrong actions?

-- 
-- 
BR, Donat Zenichev
Wnet VoIP team
Tel:  +380(44) 5-900-808
http://wnet.ua
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org