Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-20 Thread Dmitri Maziuk

On 2016-09-20 09:53, Ken Gaillot wrote:


I do think ifdown is not quite the best failure simulation, since there
aren't that many real-world situation that merely take an interface
down. To simulate network loss (without pulling the cable), I think
maybe using the firewall to block all traffic to and from the interface
might be better.


Or unloading the driver module to simulate NIC hardware failure.

Dep. on how close you look at the interface it may or may not matter 
that pulling the cable/the other side going down will result in NO 
CARRIER whereas firewalling it off will not.


Dima


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-20 Thread Dejan Muhamedagic
On Tue, Sep 20, 2016 at 01:13:23PM +, Auer, Jens wrote:
> Hi,
> 
> >> I've decided to create two answers for the two problems. The cluster
> >> still fails to relocate the resource after unloading the modules even
> >> with resource-agents 3.9.7
> > From the point of view of the resource agent,
> > you configured it to use a non-existing network.
> > Which it considers to be a configuration error,
> > which is treated by pacemaker as
> > "don't try to restart anywhere
> > but let someone else configure it properly, first".
> > Still, I have yet to see what scenario you are trying to test here.
> > To me, this still looks like "scenario evil admin".  If so, I'd not even
> > try, at least not on the pacemaker configuration level.
> It's not evil admin as this would not make sense. I am trying to find a way 
> to force a failover condition e.g. by simulating a network card defect or 
> network outage without running to the server room every time. 

Better use iptables. Bringing the interface down is not the same
as network card going bad.

Thanks,

Dejan

> > CONFIDENTIALITY NOTICE:
> > Oh please :-/
> > This is a public mailing list.
> Sorry, this is a standard disclaimer I usually remove. We are forced to add 
> this to e-mails, but I think this is fairly common for commercial companies.
> 
> >> Also the netmask and the ip address are wrong. I have configured the
> >> device to 192.168.120.10 with netmask 192.168.120.10. How does IpAddr2
> >> get the wrong configuration? I have no idea.
> >A netmask of "192.168.120.10" is nonsense.
> >That is the address, not a mask.
> Oops, my fault when writing the e-mail. Obviously this is the address. The 
> configured netmask for the device is 255.255.255.0, but after IPaddr2 brings 
> it up again it is 255.255.255.255 which is not what I configured in the 
> betwork configuration. 
> 
> > Also, according to some posts back,
> > you have configured it in pacemaker with
> > cidr_netmask=32, which is not particularly useful either.
> Thanks for pointing this out. I copied the parameters from the 
> manual/tutorial, but did not think about the values.
> 
> > Again: the IPaddr2 resource agent is supposed to control the assignment
> > of an IP address, hence the name.
> > It is not supposed to create or destroy network interfaces,
> > or configure bonding, or bridges, or anything like that.
> > In fact, it is not even supposed to bring up or down the interfaces,
> > even though for "convenience" it seems to do "ip link set up".
> This is what made me wonder in the beginning. When I bring down the device, 
> this leads to a failure of the resource agent which is exactly what I 
> expected. I did not expect it to bring the device up  again, and definitetly 
> not ignoring the default network configuration.
> 
> > Monitoring connectivity, or dealing with removed interface drivers,
> > or unplugged devices, or whatnot, has to be dealt with elsewhere.
> I am using a ping daemon for that. 
> 
> > What you did is: down the bond, remove all slave assignments, even
> > remove the driver, and expect the resource agent to "heal" things that
> > it does not know about. It can not.
> I am not expecting the RA to heal anything. How could it? And why would I 
> expect it? In fact I am expecting the opposite that is a consistent failure 
> when the device is down. This may be also wrong because you can assign ip 
> addresses to downed devices.
> 
> My initial expectation was that the resource cannot be started when the 
> device is down and then is relocated. I think this more or less the core 
> functionality of the cluster. I can see a reason why it does not switch to 
> another node when there is a configuration error in the cluster because it is 
> fair to assume that the configuration is identical (wrong) on all nodes. But 
> what happens if the network device is broken? The server would start, fail to 
> assign the ip address and then prevent the whole cluster from working? What 
> happens if the network card breaks while the cluster is running? 
> 
> Best wishes,
>   Jens
> 
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
> 
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-20 Thread Auer, Jens
Hi,

>> I've decided to create two answers for the two problems. The cluster
>> still fails to relocate the resource after unloading the modules even
>> with resource-agents 3.9.7
> From the point of view of the resource agent,
> you configured it to use a non-existing network.
> Which it considers to be a configuration error,
> which is treated by pacemaker as
> "don't try to restart anywhere
> but let someone else configure it properly, first".
> Still, I have yet to see what scenario you are trying to test here.
> To me, this still looks like "scenario evil admin".  If so, I'd not even
> try, at least not on the pacemaker configuration level.
It's not evil admin as this would not make sense. I am trying to find a way to 
force a failover condition e.g. by simulating a network card defect or network 
outage without running to the server room every time. 

> CONFIDENTIALITY NOTICE:
> Oh please :-/
> This is a public mailing list.
Sorry, this is a standard disclaimer I usually remove. We are forced to add 
this to e-mails, but I think this is fairly common for commercial companies.

>> Also the netmask and the ip address are wrong. I have configured the
>> device to 192.168.120.10 with netmask 192.168.120.10. How does IpAddr2
>> get the wrong configuration? I have no idea.
>A netmask of "192.168.120.10" is nonsense.
>That is the address, not a mask.
Oops, my fault when writing the e-mail. Obviously this is the address. The 
configured netmask for the device is 255.255.255.0, but after IPaddr2 brings it 
up again it is 255.255.255.255 which is not what I configured in the betwork 
configuration. 

> Also, according to some posts back,
> you have configured it in pacemaker with
> cidr_netmask=32, which is not particularly useful either.
Thanks for pointing this out. I copied the parameters from the manual/tutorial, 
but did not think about the values.

> Again: the IPaddr2 resource agent is supposed to control the assignment
> of an IP address, hence the name.
> It is not supposed to create or destroy network interfaces,
> or configure bonding, or bridges, or anything like that.
> In fact, it is not even supposed to bring up or down the interfaces,
> even though for "convenience" it seems to do "ip link set up".
This is what made me wonder in the beginning. When I bring down the device, 
this leads to a failure of the resource agent which is exactly what I expected. 
I did not expect it to bring the device up  again, and definitetly not ignoring 
the default network configuration.

> Monitoring connectivity, or dealing with removed interface drivers,
> or unplugged devices, or whatnot, has to be dealt with elsewhere.
I am using a ping daemon for that. 

> What you did is: down the bond, remove all slave assignments, even
> remove the driver, and expect the resource agent to "heal" things that
> it does not know about. It can not.
I am not expecting the RA to heal anything. How could it? And why would I 
expect it? In fact I am expecting the opposite that is a consistent failure 
when the device is down. This may be also wrong because you can assign ip 
addresses to downed devices.

My initial expectation was that the resource cannot be started when the device 
is down and then is relocated. I think this more or less the core functionality 
of the cluster. I can see a reason why it does not switch to another node when 
there is a configuration error in the cluster because it is fair to assume that 
the configuration is identical (wrong) on all nodes. But what happens if the 
network device is broken? The server would start, fail to assign the ip address 
and then prevent the whole cluster from working? What happens if the network 
card breaks while the cluster is running? 

Best wishes,
  Jens

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-20 Thread Lars Ellenberg
On Tue, Sep 20, 2016 at 11:44:58AM +, Auer, Jens wrote:
> Hi,
> 
> I've decided to create two answers for the two problems. The cluster
> still fails to relocate the resource after unloading the modules even
> with resource-agents 3.9.7

>From the point of view of the resource agent,
you configured it to use a non-existing network.
Which it considers to be a configuration error,
which is treated by pacemaker as
"don't try to restart anywhere
but let someone else configure it properly, first".

I think the OCF_ERR_CONFIGURED is good, though, otherwise 
configuration errors might go unnoticed for quite some time.
A network interface is not supposed to "vanish".

You may disagree with that choice,
in which case you could edit the resource agent to treat it not as
configuration error, but as "required component not installed"
(OCF_ERR_CONFIGURED vs OCF_ERR_INSTALLED), and pacemaker will
"try to find some other node with required components available",
before giving up completely.

Still, I have yet to see what scenario you are trying to test here.
To me, this still looks like "scenario evil admin".  If so, I'd not even
try, at least not on the pacemaker configuration level.

> CONFIDENTIALITY NOTICE:

Oh please :-/
This is a public mailing list.

> There seems to be some difference because the device is not RUNNING;

> Also the netmask and the ip address are wrong. I have configured the
> device to 192.168.120.10 with netmask 192.168.120.10. How does IpAddr2
> get the wrong configuration? I have no idea.

A netmask of "192.168.120.10" is nonsense.
That is the address, not a mask.

Also, according to some posts back,
you have configured it in pacemaker with
cidr_netmask=32, which is not particularly useful either.

You should use the netmask of whatever subnet is supposedly actually
reachable via that address and interface. Typical masks are e.g.
/24, /20, /16 resp. 255.255.255.0, 255.255.240.0, 255.255.0.0

Apparently the RA is "nice" enough (or maybe buggy enough)
to let that slip, and guess the netmask from the routing tables,
or fall back to whatever builtin defaults there are on the various
layers of tools involved.

Again: the IPaddr2 resource agent is supposed to control the assignment
of an IP address, hence the name.

It is not supposed to create or destroy network interfaces,
or configure bonding, or bridges, or anything like that.

In fact, it is not even supposed to bring up or down the interfaces,
even though for "convenience" it seems to do "ip link set up".

That is not a bug, but limited scope.

If you wanted to test the reaction of the cluster to a vanishing
IP address, the correct test would be an
  "ip addr del 192.168.120.10 dev bond0"

And the expectation is that it will notice, and just re-add the address.
That is the scope of the IPaddr2 resource agent.

Monitoring connectivity, or dealing with removed interface drivers,
or unplugged devices, or whatnot, has to be dealt with elsewhere.

What you did is: down the bond, remove all slave assignments, even
remove the driver, and expect the resource agent to "heal" things that
it does not know about. It can not.

-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-20 Thread Auer, Jens
Hi,

one thing to add is that everything works as expected when I physically unplug 
the network cables to force a failover. 

Best wishes,
  Jens

--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.a...@cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
Group Inc. and its affiliates may be contained in this message. If you are not 
a recipient indicated or intended in this message (or responsible for delivery 
of this message to such person), or you think for any reason that this message 
may have been addressed to you in error, you may not use or copy or deliver 
this message to anyone else. In such case, you should destroy this message and 
are asked to notify the sender by reply e-mail.


Von: Auer, Jens [jens.a...@cgi.com]
Gesendet: Dienstag, 20. September 2016 13:44
An: Cluster Labs - All topics related to open-source clustering welcomed
Betreff: Re: [ClusterLabs] Virtual ip resource restarted on node with down 
network device

Hi,

I've decided to create two answers for the two problems. The cluster still 
fails to relocate the resource after unloading the modules even with 
resource-agents 3.9.7
MDA1PFP-S01 11:42:50 2533 0 ~ # yum list resource-agents
Loaded plugins: langpacks, product-id, search-disabled-repos, 
subscription-manager
Installed Packages
resource-agents.x86_64  
  3.9.7-4.el7   
 @/resource-agents-3.9.7-4.el7.x86_64

Sep 20 11:42:52 MDA1PFP-S01 crmd[13908]: warning: Action 9 (mda-ip_start_0) on 
MDA1PFP-PCS01 failed (target: 0 vs. rc: 6): Error
Sep 20 11:42:52 MDA1PFP-S01 crmd[13908]: warning: Action 9 (mda-ip_start_0) on 
MDA1PFP-PCS01 failed (target: 0 vs. rc: 6): Error
Sep 20 11:42:52 MDA1PFP-S01 crmd[13908]:  notice: Transition 5 (Complete=3, 
Pending=0, Fired=0, Skipped=0, Incomplete=1, 
Source=/var/lib/pacemaker/pengine/pe-input-552.bz2): Complete
Sep 20 11:42:52 MDA1PFP-S01 pengine[13907]:  notice: On loss of CCM Quorum: 
Ignore
Sep 20 11:42:52 MDA1PFP-S01 pengine[13907]: warning: Processing failed op start 
for mda-ip on MDA1PFP-PCS01: not configured (6)
Sep 20 11:42:52 MDA1PFP-S01 pengine[13907]:   error: Preventing mda-ip from 
re-starting anywhere: operation start failed 'not configured' (6)
Sep 20 11:42:52 MDA1PFP-S01 pengine[13907]: warning: Processing failed op start 
for mda-ip on MDA1PFP-PCS01: not configured (6)
Sep 20 11:42:52 MDA1PFP-S01 pengine[13907]:   error: Preventing mda-ip from 
re-starting anywhere: operation start failed 'not configured' (6)
Sep 20 11:42:52 MDA1PFP-S01 pengine[13907]:  notice: Stopmda-ip 
(MDA1PFP-PCS01)
Sep 20 11:42:52 MDA1PFP-S01 pengine[13907]:  notice: Calculated Transition 6: 
/var/lib/pacemaker/pengine/pe-input-553.bz2
Sep 20 11:42:52 MDA1PFP-S01 crmd[13908]:  notice: Initiating action 2: stop 
mda-ip_stop_0 on MDA1PFP-PCS01 (local)
Sep 20 11:42:52 MDA1PFP-S01 IPaddr2(mda-ip)[15336]: INFO: IP status = no, 
IP_CIP=
Sep 20 11:42:52 MDA1PFP-S01 lrmd[13905]:  notice: mda-ip_stop_0:15336:stderr [ 
Device "bond0" does not exist. ]
Sep 20 11:42:52 MDA1PFP-S01 crmd[13908]:  notice: Operation mda-ip_stop_0: ok 
(node=MDA1PFP-PCS01, call=18, rc=0, cib-update=48, confirmed=true)
Sep 20 11:42:53 MDA1PFP-S01 corosync[13887]: [TOTEM ] Retransmit List: 93
Sep 20 11:42:53 MDA1PFP-S01 corosync[13887]: [TOTEM ] Retransmit List: 93 96 98
Sep 20 11:42:53 MDA1PFP-S01 corosync[13887]: [TOTEM ] Retransmit List: 93 98 9a 
9c
Sep 20 11:42:53 MDA1PFP-S01 corosync[13887]: [TOTEM ] Marking ringid 1 
interface 192.168.120.10 FAULTY
Sep 20 11:42:53 MDA1PFP-S01 corosync[13887]: [TOTEM ] Retransmit List: 98 9c 9f 
a1
Sep 20 11:42:53 MDA1PFP-S01 crmd[13908]:  notice: Transition 6 (Complete=2, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-553.bz2): Complete
Sep 20 11:42:53 MDA1PFP-S01 crmd[13908]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
origin=notify_crmd ]
Sep 20 11:42:53 MDA1PFP-S01 crmd[13908]:  notice: State transition S_IDLE -> 
S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Sep 20 11:42:53 MDA1PFP-S01 pengine[13907]:  notice: On loss of CCM Quorum: 
Ignore
Sep 20 11:42:53 MDA1PFP-S01 pengine[13907]: warning: Processing failed op start 
for mda-ip on MDA1PFP-PCS01: not configured (6)
Sep 20 11:42:53 MDA1PFP-S01 pengine[13907]:   error: Preventing mda-ip from 
re-starting anywhere: operation start failed 'not configured' (6)
Sep 20 11:42:53 MDA1PFP-S01 pengine[13907]: warning: Forcing mda-ip away from 
MDA1PFP-PCS01 after 100 failures (max=100)
Sep 20 11:42:53 MDA1P

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-20 Thread Auer, Jens
192.168.120.20: icmp_seq=3 ttl=64 time=0.029 ms

MDA1PFP-S02 11:33:31 1273 0 ~ # ping 192.168.120.20
PING 192.168.120.20 (192.168.120.20) 56(84) bytes of data.
>From 192.168.120.11 icmp_seq=10 Destination Host Unreachable
>From 192.168.120.11 icmp_seq=11 Destination Host Unreachable
>From 192.168.120.11 icmp_seq=12 Destination Host Unreachable

Best wishes,
  Jens


--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.a...@cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
Group Inc. and its affiliates may be contained in this message. If you are not 
a recipient indicated or intended in this message (or responsible for delivery 
of this message to such person), or you think for any reason that this message 
may have been addressed to you in error, you may not use or copy or deliver 
this message to anyone else. In such case, you should destroy this message and 
are asked to notify the sender by reply e-mail.


Von: Ken Gaillot [kgail...@redhat.com]
Gesendet: Montag, 19. September 2016 17:31
An: users@clusterlabs.org
Betreff: Re: [ClusterLabs] Virtual ip resource restarted on node with down 
network device

On 09/19/2016 10:04 AM, Jan Pokorný wrote:
> On 19/09/16 10:18 +, Auer, Jens wrote:
>> Ok, after reading the log files again I found
>>
>> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Initiating action 3: stop 
>> mda-ip_stop_0 on MDA1PFP-PCS01 (local)
>> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: 
>> MDA1PFP-PCS01-mda-ip_monitor_1000:14 [ ocf-exit-reason:Unknown interface 
>> [bond0] No such device.\n ]
>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface 
>> [bond0] No such device.
>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed
>> Sep 19 10:03:45 MDA1PFP-S01 lrmd[7794]:  notice: mda-ip_stop_0:8745:stderr [ 
>> ocf-exit-reason:Unknown interface [bond0] No such device. ]
>> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Operation mda-ip_stop_0: ok 
>> (node=MDA1PFP-PCS01, call=16, rc=0, cib-update=49, confirmed=true)
>> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: Transition 3 (Complete=2, 
>> Pending=0, Fired=0, Skipped=0, Incomplete=0, 
>> Source=/var/lib/pacemaker/pengine/pe-input-501.bz2): Complete
>> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition 
>> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
>> origin=notify_crmd ]
>> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition S_IDLE -> 
>> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
>> origin=abort_transition_graph ]
>> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:  notice: On loss of CCM Quorum: 
>> Ignore
>> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: warning: Processing failed op 
>> monitor for mda-ip on MDA1PFP-PCS01: not configured (6)
>> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:   error: Preventing mda-ip from 
>> re-starting anywhere: operation monitor failed 'not configured' (6)
>>
>> I think that explains why the resource is not started on the other
>> node, but I am not sure this is a good decision. It seems to be a
>> little harsh to prevent the resource from starting anywhere,
>> especially considering that the other node will be able to start the
>> resource.

The resource agent is supposed to return "not configured" only when the
*pacemaker* configuration of the resource is inherently invalid, so
there's no chance of it starting anywhere.

As Jan suggested, make sure you've applied any resource-agents updates.
If that doesn't fix it, it sounds like a bug in the agent, or something
really is wrong with your pacemaker resource config.

>
> The problem to start with is that based on
>
>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface 
>> [bond0] No such device.
>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed
>
> you may be using too ancient version resource-agents:
>
> https://github.com/ClusterLabs/resource-agents/pull/320
>
> so until you update, the troubleshooting would be quite moot.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-20 Thread Auer, Jens
nput-555.bz2
Sep 20 11:43:02 MDA1PFP-S01 crmd[13908]:  notice: Transition 8 (Complete=0, 
Pending=0, Fired=0, Skipped=0, Incomplete=0, 
Source=/var/lib/pacemaker/pengine/pe-input-555.bz2): Complete
Sep 20 11:43:02 MDA1PFP-S01 crmd[13908]:  notice: State transition 
S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
origin=notify_crmd ]

Cheers,
  Jens

--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.a...@cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
Group Inc. and its affiliates may be contained in this message. If you are not 
a recipient indicated or intended in this message (or responsible for delivery 
of this message to such person), or you think for any reason that this message 
may have been addressed to you in error, you may not use or copy or deliver 
this message to anyone else. In such case, you should destroy this message and 
are asked to notify the sender by reply e-mail.


Von: Auer, Jens [jens.a...@cgi.com]
Gesendet: Montag, 19. September 2016 16:36
An: Cluster Labs - All topics related to open-source clustering welcomed
Betreff: Re: [ClusterLabs] Virtual ip resource restarted on node with down 
network device

Hi,

>> After the restart ifconfig still shows the device bond0 to be not RUNNING:
>> MDA1PFP-S01 09:07:54 2127 0 ~ # ifconfig
>> bond0: flags=5123<UP,BROADCAST,MASTER,MULTICAST>  mtu 1500
>> inet 192.168.120.20  netmask 255.255.255.255  broadcast 0.0.0.0
>> ether a6:17:2c:2a:72:fc  txqueuelen 3  (Ethernet)
>> RX packets 2034  bytes 286728 (280.0 KiB)
>> RX errors 0  dropped 29  overruns 0  frame 0
>> TX packets 2284  bytes 355975 (347.6 KiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

There seems to be some difference because the device is not RUNNING;
mdaf-pf-pep-spare 14:17:53 999 0 ~ # ifconfig
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
inet 192.168.120.10  netmask 255.255.255.0  broadcast 192.168.120.255
inet6 fe80::5eb9:1ff:fe9c:e7fc  prefixlen 64  scopeid 0x20
ether 5c:b9:01:9c:e7:fc  txqueuelen 3  (Ethernet)
RX packets 15455692  bytes 22377220306 (20.8 GiB)
RX errors 0  dropped 2392  overruns 0  frame 0
TX packets 14706747  bytes 21361519159 (19.8 GiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Also the netmask and the ip address are wrong. I have configured the device to 
192.168.120.10 with netmask 192.168.120.10. How does IpAddr2 get the wrong 
configuration? I have no idea.

>Anyway, you should rather be using "ip" command from iproute suite
>than various if* tools that come short in some cases:
>http://inai.de/2008/02/19
>This would also be consistent with IPaddr2 uses under the hood.

We are using RedHat 7 and this uses either NetworkManager or the network 
scripts. We use the later and ifup/ifdown should be the correct way to use the 
network card. I also tried using ip link set dev bond0 up/down and it brings up 
the device with the correct ip address and network mask.

Best wishes,
  Jens

--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.a...@cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
Group Inc. and its affiliates may be contained in this message. If you are not 
a recipient indicated or intended in this message (or responsible for delivery 
of this message to such person), or you think for any reason that this message 
may have been addressed to you in error, you may not use or copy or deliver 
this message to anyone else. In such case, you should destroy this message and 
are asked to notify the sender by reply e-mail.


Von: Jan Pokorný [jpoko...@redhat.com]
Gesendet: Montag, 19. September 2016 14:57
An: Cluster Labs - All topics related to open-source clustering welcomed
Betreff: Re: [ClusterLabs] Virtual ip resource restarted on node with down 
network device

On 19/09/16 09:15 +, Auer, Jens wrote:
> After the restart ifconfig still shows the device bond0 to be not RUNNING:
> MDA1PFP-S01 09:07:54 2127 0 ~ # ifconfig
> bond0: flags=5123<UP,BROADCAST,MASTER,MULTICAST>  mtu 1500
> inet 192.168.120.20  netmask 255.255.255.255  broadcast 0.0.0.0
> ether a6:17:2c:2a:72:fc  txqueuelen 3  (Ethernet)
> RX packets 2034  bytes 286728 (280.0 KiB)
> RX errors 0  dropped 29  over

Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-19 Thread Ken Gaillot
On 09/19/2016 10:04 AM, Jan Pokorný wrote:
> On 19/09/16 10:18 +, Auer, Jens wrote:
>> Ok, after reading the log files again I found 
>>
>> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Initiating action 3: stop 
>> mda-ip_stop_0 on MDA1PFP-PCS01 (local)
>> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: 
>> MDA1PFP-PCS01-mda-ip_monitor_1000:14 [ ocf-exit-reason:Unknown interface 
>> [bond0] No such device.\n ]
>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface 
>> [bond0] No such device.
>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed
>> Sep 19 10:03:45 MDA1PFP-S01 lrmd[7794]:  notice: mda-ip_stop_0:8745:stderr [ 
>> ocf-exit-reason:Unknown interface [bond0] No such device. ]
>> Sep 19 10:03:45 MDA1PFP-S01 crmd[7797]:  notice: Operation mda-ip_stop_0: ok 
>> (node=MDA1PFP-PCS01, call=16, rc=0, cib-update=49, confirmed=true)
>> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: Transition 3 (Complete=2, 
>> Pending=0, Fired=0, Skipped=0, Incomplete=0, 
>> Source=/var/lib/pacemaker/pengine/pe-input-501.bz2): Complete
>> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition 
>> S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL 
>> origin=notify_crmd ]
>> Sep 19 10:03:46 MDA1PFP-S01 crmd[7797]:  notice: State transition S_IDLE -> 
>> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
>> origin=abort_transition_graph ]
>> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:  notice: On loss of CCM Quorum: 
>> Ignore
>> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]: warning: Processing failed op 
>> monitor for mda-ip on MDA1PFP-PCS01: not configured (6)
>> Sep 19 10:03:46 MDA1PFP-S01 pengine[7796]:   error: Preventing mda-ip from 
>> re-starting anywhere: operation monitor failed 'not configured' (6)
>>
>> I think that explains why the resource is not started on the other
>> node, but I am not sure this is a good decision. It seems to be a
>> little harsh to prevent the resource from starting anywhere,
>> especially considering that the other node will be able to start the
>> resource. 

The resource agent is supposed to return "not configured" only when the
*pacemaker* configuration of the resource is inherently invalid, so
there's no chance of it starting anywhere.

As Jan suggested, make sure you've applied any resource-agents updates.
If that doesn't fix it, it sounds like a bug in the agent, or something
really is wrong with your pacemaker resource config.

> 
> The problem to start with is that based on 
> 
>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: ERROR: Unknown interface 
>> [bond0] No such device.
>> Sep 19 10:03:45 MDA1PFP-S01 IPaddr2(mda-ip)[8745]: WARNING: [findif] failed
> 
> you may be using too ancient version resource-agents:
> 
> https://github.com/ClusterLabs/resource-agents/pull/320
> 
> so until you update, the troubleshooting would be quite moot.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-19 Thread Auer, Jens
Hi,

>> After the restart ifconfig still shows the device bond0 to be not RUNNING:
>> MDA1PFP-S01 09:07:54 2127 0 ~ # ifconfig
>> bond0: flags=5123<UP,BROADCAST,MASTER,MULTICAST>  mtu 1500
>> inet 192.168.120.20  netmask 255.255.255.255  broadcast 0.0.0.0
>> ether a6:17:2c:2a:72:fc  txqueuelen 3  (Ethernet)
>> RX packets 2034  bytes 286728 (280.0 KiB)
>> RX errors 0  dropped 29  overruns 0  frame 0
>> TX packets 2284  bytes 355975 (347.6 KiB)
>> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

There seems to be some difference because the device is not RUNNING;
mdaf-pf-pep-spare 14:17:53 999 0 ~ # ifconfig
bond0: flags=5187<UP,BROADCAST,RUNNING,MASTER,MULTICAST>  mtu 1500
inet 192.168.120.10  netmask 255.255.255.0  broadcast 192.168.120.255
inet6 fe80::5eb9:1ff:fe9c:e7fc  prefixlen 64  scopeid 0x20
ether 5c:b9:01:9c:e7:fc  txqueuelen 3  (Ethernet)
RX packets 15455692  bytes 22377220306 (20.8 GiB)
RX errors 0  dropped 2392  overruns 0  frame 0
TX packets 14706747  bytes 21361519159 (19.8 GiB)
TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

Also the netmask and the ip address are wrong. I have configured the device to 
192.168.120.10 with netmask 192.168.120.10. How does IpAddr2 get the wrong 
configuration? I have no idea.

>Anyway, you should rather be using "ip" command from iproute suite
>than various if* tools that come short in some cases:
>http://inai.de/2008/02/19
>This would also be consistent with IPaddr2 uses under the hood.

We are using RedHat 7 and this uses either NetworkManager or the network 
scripts. We use the later and ifup/ifdown should be the correct way to use the 
network card. I also tried using ip link set dev bond0 up/down and it brings up 
the device with the correct ip address and network mask. 

Best wishes,
  Jens

--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.a...@cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
Group Inc. and its affiliates may be contained in this message. If you are not 
a recipient indicated or intended in this message (or responsible for delivery 
of this message to such person), or you think for any reason that this message 
may have been addressed to you in error, you may not use or copy or deliver 
this message to anyone else. In such case, you should destroy this message and 
are asked to notify the sender by reply e-mail.


Von: Jan Pokorný [jpoko...@redhat.com]
Gesendet: Montag, 19. September 2016 14:57
An: Cluster Labs - All topics related to open-source clustering welcomed
Betreff: Re: [ClusterLabs] Virtual ip resource restarted on node with down 
network device

On 19/09/16 09:15 +, Auer, Jens wrote:
> After the restart ifconfig still shows the device bond0 to be not RUNNING:
> MDA1PFP-S01 09:07:54 2127 0 ~ # ifconfig
> bond0: flags=5123<UP,BROADCAST,MASTER,MULTICAST>  mtu 1500
> inet 192.168.120.20  netmask 255.255.255.255  broadcast 0.0.0.0
> ether a6:17:2c:2a:72:fc  txqueuelen 3  (Ethernet)
> RX packets 2034  bytes 286728 (280.0 KiB)
> RX errors 0  dropped 29  overruns 0  frame 0
> TX packets 2284  bytes 355975 (347.6 KiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

This seems to suggest bond0 interface is up and address-assigned
(well, the netmask is strange).  So there would be nothing
contradictory to what I said on the address of IPaddr2.

Anyway, you should rather be using "ip" command from iproute suite
than various if* tools that come short in some cases:
http://inai.de/2008/02/19
This would also be consistent with IPaddr2 uses under the hood.

--
Jan (Poki)

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-19 Thread Lars Ellenberg
On Mon, Sep 19, 2016 at 02:57:57PM +0200, Jan Pokorný wrote:
> On 19/09/16 09:15 +, Auer, Jens wrote:
> > After the restart ifconfig still shows the device bond0 to be not RUNNING:
> > MDA1PFP-S01 09:07:54 2127 0 ~ # ifconfig
> > bond0: flags=5123  mtu 1500
> > inet 192.168.120.20  netmask 255.255.255.255  broadcast 0.0.0.0
> > ether a6:17:2c:2a:72:fc  txqueuelen 3  (Ethernet)
> > RX packets 2034  bytes 286728 (280.0 KiB)
> > RX errors 0  dropped 29  overruns 0  frame 0
> > TX packets 2284  bytes 355975 (347.6 KiB)
> > TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
> 
> This seems to suggest bond0 interface is up and address-assigned
> (well, the netmask is strange).  So there would be nothing
> contradictory to what I said on the address of IPaddr2.
> 
> Anyway, you should rather be using "ip" command from iproute suite
> than various if* tools that come short in some cases:
> http://inai.de/2008/02/19
> This would also be consistent with IPaddr2 uses under the hood.

The resource agent only controlls and checks
the presence of a certain IP on a certain NIC
(and some parameters).

What you likely ended up with after the "restart"
is an "empty" bonding device with that IP assigned,
but without any "slave" devices, or at least
with the slave devices still set to link down.

If you really wanted the RA to also know about the slaves,
and be able to properly and fully configure a bonding,
you'd have to enhance that resource agent.

If you want the IP to move to some other node,
if it has connectivity problems, use a "ping" and/or
"ethmonitor" resource in addition to the IP.

If you wanted to test-drive cluster response against a
failing network device, your test was wrong.

If you wanted to test-drive cluster response against
a "fat fingered" (or even evil) operator or admin:
give up right there...
You'll never be able to cover it all :-)


-- 
: Lars Ellenberg
: LINBIT | Keeping the Digital World Running
: DRBD -- Heartbeat -- Corosync -- Pacemaker
: R, Integration, Ops, Consulting, Support

DRBD® and LINBIT® are registered trademarks of LINBIT

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-19 Thread Jan Pokorný
On 19/09/16 09:15 +, Auer, Jens wrote:
> After the restart ifconfig still shows the device bond0 to be not RUNNING:
> MDA1PFP-S01 09:07:54 2127 0 ~ # ifconfig
> bond0: flags=5123  mtu 1500
> inet 192.168.120.20  netmask 255.255.255.255  broadcast 0.0.0.0
> ether a6:17:2c:2a:72:fc  txqueuelen 3  (Ethernet)
> RX packets 2034  bytes 286728 (280.0 KiB)
> RX errors 0  dropped 29  overruns 0  frame 0
> TX packets 2284  bytes 355975 (347.6 KiB)
> TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

This seems to suggest bond0 interface is up and address-assigned
(well, the netmask is strange).  So there would be nothing
contradictory to what I said on the address of IPaddr2.

Anyway, you should rather be using "ip" command from iproute suite
than various if* tools that come short in some cases:
http://inai.de/2008/02/19
This would also be consistent with IPaddr2 uses under the hood.

-- 
Jan (Poki)


pgpb5Futj8WMD.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-16 Thread Jan Pokorný
On 16/09/16 11:01 -0500, Ken Gaillot wrote:
> On 09/16/2016 10:43 AM, Auer, Jens wrote:
>> thanks for the help.
>> 
>>> I'm not sure what you mean by "the device the virtual ip is attached
>>> to", but a separate question is why the resource agent reported that
>>> restarting the IP was successful, even though that device was
>>> unavailable. If the monitor failed when the device was made unavailable,
>>> I would expect the restart to fail as well.
>> 
>> I created the virtual ip with parameter nic=bond0, and this is the
>> device I am bringing down and was referring to in my question. I
>> think the current behavior is a little inconsistent. I bring down
>> the device and pacemaker recognizes this and restarts the resource.
>> However, the monitor then should fail again, but it just doesn't
>> detect any problems. 
> 
> That is odd. Pacemaker is just acting on what the resource agent
> reports, so the issue will be in the agent.

I'd note that IPaddr2 agent attempts to bring the network interface
(back) up if not already on start so this appears, perhaps against
one's liking and expectations (if putting it down is considered
a sufficiently big hammer to observe a service failover),
as a magic "self-healing" :-)

Would "rmmod " be a better hammer of choice?

-- 
Jan (Poki)


pgpmIdFtGfIOx.pgp
Description: PGP signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-16 Thread Ken Gaillot
On 09/16/2016 10:43 AM, Auer, Jens wrote:
> Hi,
> 
> thanks for the help.
> 
>> I'm not sure what you mean by "the device the virtual ip is attached
>> to", but a separate question is why the resource agent reported that
>> restarting the IP was successful, even though that device was
>> unavailable. If the monitor failed when the device was made unavailable,
>> I would expect the restart to fail as well.
> 
> I created the virtual ip with parameter nic=bond0, and this is the device I 
> am bringing down
> and was referring to in my question. I think the current behavior is a little 
> inconsistent. I bring 
> down the device and pacemaker recognizes this and restarts the resource. 
> However, the monitor
> then should fail again, but it just doesn't detect any problems. 

That is odd. Pacemaker is just acting on what the resource agent
reports, so the issue will be in the agent. Agents are usually fairly
simple shell scripts, so you could just look at what it does, and try
running those commands manually and see what the results are.

There are also some pcs commands debug-start, debug-monitor, etc. that
run the agent directly, using the configuration from the cluster.

And you can look in the system log and pacemaker detail log around the
time of the incident for any interesting messages.

> Cheers,
>   Jens
> 
> --
> Jens Auer | CGI | Software-Engineer
> CGI (Germany) GmbH & Co. KG
> Rheinstraße 95 | 64295 Darmstadt | Germany
> T: +49 6151 36860 154
> jens.a...@cgi.com
> Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
> de.cgi.com/pflichtangaben.
> 
> CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
> Group Inc. and its affiliates may be contained in this message. If you are 
> not a recipient indicated or intended in this message (or responsible for 
> delivery of this message to such person), or you think for any reason that 
> this message may have been addressed to you in error, you may not use or copy 
> or deliver this message to anyone else. In such case, you should destroy this 
> message and are asked to notify the sender by reply e-mail.
> 
> ________
> Von: Ken Gaillot [kgail...@redhat.com]
> Gesendet: Freitag, 16. September 2016 17:27
> An: users@clusterlabs.org
> Betreff: Re: [ClusterLabs] Virtual ip resource restarted on node with down 
> network device
> 
> On 09/16/2016 10:08 AM, Auer, Jens wrote:
>> Hi,
>>
>> I have configured an Active/Passive cluster to host a virtual ip
>> address. To test failovers, I shutdown the device the virtual ip is
>> attached to and expected that it moves to the other node. However, the
>> virtual ip is detected as FAILED, but is then restarted on the same
>> node. I was able to solve this by using a ping resource which we want to
>> do anyway, but I am wondering why the resource is restarted on the node
>> and no failure is detected anymore.
> 
> If a *node* fails, pacemaker will recover all its resources elsewhere,
> if possible.
> 
> If a *resource* fails but the node is OK, the response is configurable,
> via the "on-fail" operation option and "migration-threshold" resource
> option.
> 
> By default, on-fail=restart for monitor operations, and
> migration-threshold=INFINITY. This means that if a monitor fails,
> pacemaker will attempt to restart the resource on the same node.
> 
> To get an immediate failover of the resource, set migration-threshold=1
> on the resource.
> 
> I'm not sure what you mean by "the device the virtual ip is attached
> to", but a separate question is why the resource agent reported that
> restarting the IP was successful, even though that device was
> unavailable. If the monitor failed when the device was made unavailable,
> I would expect the restart to fail as well.
> 
>>
>> On my setup, this is very easy to reproduce:
>> 1. Start cluster with virtual ip
>> 2. On the node hosting the virtual ip, bring down the network device
>> with ifdown
>> => The resource is detected as failed
>> => The resource is restarted
>> => No failures are dected from now on
>>
>> Best wishes,
>>   Jens
>>
>> --
>> *Jens Auer *| CGI | Software-Engineer
>> CGI (Germany) GmbH & Co. KG
>> Rheinstraße 95 | 64295 Darmstadt | Germany
>> T: +49 6151 36860 154
>> _jens.auer@cgi.com_ <mailto:jens.a...@cgi.com>
>> Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie
>> unter _de.cgi.com/pflichtangaben_ <http://de.cgi.com/pflichtangaben>.

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-16 Thread Auer, Jens
Hi,

thanks for the help.

> I'm not sure what you mean by "the device the virtual ip is attached
> to", but a separate question is why the resource agent reported that
> restarting the IP was successful, even though that device was
> unavailable. If the monitor failed when the device was made unavailable,
> I would expect the restart to fail as well.

I created the virtual ip with parameter nic=bond0, and this is the device I am 
bringing down
and was referring to in my question. I think the current behavior is a little 
inconsistent. I bring 
down the device and pacemaker recognizes this and restarts the resource. 
However, the monitor
then should fail again, but it just doesn't detect any problems. 

Cheers,
  Jens

--
Jens Auer | CGI | Software-Engineer
CGI (Germany) GmbH & Co. KG
Rheinstraße 95 | 64295 Darmstadt | Germany
T: +49 6151 36860 154
jens.a...@cgi.com
Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie unter 
de.cgi.com/pflichtangaben.

CONFIDENTIALITY NOTICE: Proprietary/Confidential information belonging to CGI 
Group Inc. and its affiliates may be contained in this message. If you are not 
a recipient indicated or intended in this message (or responsible for delivery 
of this message to such person), or you think for any reason that this message 
may have been addressed to you in error, you may not use or copy or deliver 
this message to anyone else. In such case, you should destroy this message and 
are asked to notify the sender by reply e-mail.


Von: Ken Gaillot [kgail...@redhat.com]
Gesendet: Freitag, 16. September 2016 17:27
An: users@clusterlabs.org
Betreff: Re: [ClusterLabs] Virtual ip resource restarted on node with down 
network device

On 09/16/2016 10:08 AM, Auer, Jens wrote:
> Hi,
>
> I have configured an Active/Passive cluster to host a virtual ip
> address. To test failovers, I shutdown the device the virtual ip is
> attached to and expected that it moves to the other node. However, the
> virtual ip is detected as FAILED, but is then restarted on the same
> node. I was able to solve this by using a ping resource which we want to
> do anyway, but I am wondering why the resource is restarted on the node
> and no failure is detected anymore.

If a *node* fails, pacemaker will recover all its resources elsewhere,
if possible.

If a *resource* fails but the node is OK, the response is configurable,
via the "on-fail" operation option and "migration-threshold" resource
option.

By default, on-fail=restart for monitor operations, and
migration-threshold=INFINITY. This means that if a monitor fails,
pacemaker will attempt to restart the resource on the same node.

To get an immediate failover of the resource, set migration-threshold=1
on the resource.

I'm not sure what you mean by "the device the virtual ip is attached
to", but a separate question is why the resource agent reported that
restarting the IP was successful, even though that device was
unavailable. If the monitor failed when the device was made unavailable,
I would expect the restart to fail as well.

>
> On my setup, this is very easy to reproduce:
> 1. Start cluster with virtual ip
> 2. On the node hosting the virtual ip, bring down the network device
> with ifdown
> => The resource is detected as failed
> => The resource is restarted
> => No failures are dected from now on
>
> Best wishes,
>   Jens
>
> --
> *Jens Auer *| CGI | Software-Engineer
> CGI (Germany) GmbH & Co. KG
> Rheinstraße 95 | 64295 Darmstadt | Germany
> T: +49 6151 36860 154
> _jens.auer@cgi.com_ <mailto:jens.a...@cgi.com>
> Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie
> unter _de.cgi.com/pflichtangaben_ <http://de.cgi.com/pflichtangaben>.\

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Virtual ip resource restarted on node with down network device

2016-09-16 Thread Ken Gaillot
On 09/16/2016 10:08 AM, Auer, Jens wrote:
> Hi,
> 
> I have configured an Active/Passive cluster to host a virtual ip
> address. To test failovers, I shutdown the device the virtual ip is
> attached to and expected that it moves to the other node. However, the
> virtual ip is detected as FAILED, but is then restarted on the same
> node. I was able to solve this by using a ping resource which we want to
> do anyway, but I am wondering why the resource is restarted on the node
> and no failure is detected anymore.

If a *node* fails, pacemaker will recover all its resources elsewhere,
if possible.

If a *resource* fails but the node is OK, the response is configurable,
via the "on-fail" operation option and "migration-threshold" resource
option.

By default, on-fail=restart for monitor operations, and
migration-threshold=INFINITY. This means that if a monitor fails,
pacemaker will attempt to restart the resource on the same node.

To get an immediate failover of the resource, set migration-threshold=1
on the resource.

I'm not sure what you mean by "the device the virtual ip is attached
to", but a separate question is why the resource agent reported that
restarting the IP was successful, even though that device was
unavailable. If the monitor failed when the device was made unavailable,
I would expect the restart to fail as well.

> 
> On my setup, this is very easy to reproduce:
> 1. Start cluster with virtual ip
> 2. On the node hosting the virtual ip, bring down the network device
> with ifdown
> => The resource is detected as failed
> => The resource is restarted
> => No failures are dected from now on
> 
> Best wishes,
>   Jens
> 
> --
> *Jens Auer *| CGI | Software-Engineer
> CGI (Germany) GmbH & Co. KG
> Rheinstraße 95 | 64295 Darmstadt | Germany
> T: +49 6151 36860 154
> _jens.auer@cgi.com_ 
> Unsere Pflichtangaben gemäß § 35a GmbHG / §§ 161, 125a HGB finden Sie
> unter _de.cgi.com/pflichtangaben_ .\

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org