Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-05 Thread Ken Gaillot
On 10/05/2016 11:56 AM, Israel Brewster wrote:
>> On Oct 4, 2016, at 4:06 PM, Digimer > > wrote:
>>
>> On 04/10/16 07:50 PM, Israel Brewster wrote:
>>> On Oct 4, 2016, at 3:38 PM, Digimer >> > wrote:

 On 04/10/16 07:09 PM, Israel Brewster wrote:
> On Oct 4, 2016, at 3:03 PM, Digimer  > wrote:
>>
>> On 04/10/16 06:50 PM, Israel Brewster wrote:
>>> On Oct 4, 2016, at 2:26 PM, Ken Gaillot >> 
>>> > wrote:

 On 10/04/2016 11:31 AM, Israel Brewster wrote:
> I sent this a week ago, but never got a response, so I'm sending it
> again in the hopes that it just slipped through the cracks. It
> seems to
> me that this should just be a simple mis-configuration on my part
> causing the issue, but I suppose it could be a bug as well.
>
> I have two two-node clusters set up using corosync/pacemaker on
> CentOS
> 6.8. One cluster is simply sharing an IP, while the other one has
> numerous services and IP's set up between the two machines in the
> cluster. Both appear to be working fine. However, I was poking
> around
> today, and I noticed that on the single IP cluster, corosync,
> stonithd,
> and fenced were using "significant" amounts of processing power
> - 25%
> for corosync on the current primary node, with fenced and
> stonithd often
> showing 1-2% (not horrible, but more than any other process).
> In looking
> at my logs, I see that they are dumping messages like the
> following to
> the messages log every second or two:
>
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning:
> get_xpath_object:
> No match for //@st_delegate in /st-reply
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice:
> remote_op_done:
> Operation reboot of fai-dbs1 by fai-dbs2 for
> stonith_admin.cman.15835@fai-dbs2.c5161517: No such device
> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice:
> tengine_stonith_notify:
> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for
> fai-dbs2: No
> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
> stonith_admin.cman.15835
> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting
> Pacemaker fence
> fai-dbs2 (reset)

 The above shows that CMAN is asking pacemaker to fence a node. Even
 though fencing is disabled in pacemaker itself, CMAN is
 configured to
 use pacemaker for fencing (fence_pcmk).
>>>
>>> I never did any specific configuring of CMAN, Perhaps that's the
>>> problem? I missed some configuration steps on setup? I just
>>> followed the
>>> directions
>>> here:
>>> http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs,
>>> which disabled stonith in pacemaker via the
>>> "pcs property set stonith-enabled=false" command. Is there
>>> separate CMAN
>>> configs I need to do to get everything copacetic? If so, can you
>>> point
>>> me to some sort of guide/tutorial for that?

If you ran "pcs cluster setup", it configured CMAN for you. Normally you
don't need to modify those values, but you can see them in
/etc/cluster/cluster.conf.

>> Disabling stonith is not possible in cman, and very ill advised in
>> pacemaker. This is a mistake a lot of "tutorials" make when the author
>> doesn't understand the role of fencing.
>>
>> In your case, pcs setup cman to use the fence_pcmk "passthrough" fence
>> agent, as it should. So when something went wrong, corosync
>> detected it,
>> informed cman which then requested pacemaker to fence the peer. With
>> pacemaker not having stonith configured and enabled, it could do
>> nothing. So pacemaker returned that the fence failed and cman went
>> into
>> an infinite loop trying again and again to fence (as it should have).
>>
>> You must configure stonith (exactly how depends on your hardware),
>> then
>> enable stonith in pacemaker.
>>
>
> Gotcha. There is nothing special about the hardware, it's just two
> physical boxes connected to the network. So I guess I've got a
> choice of either a) live with the logging/load situation (since the
> system does work perfectly as-is other than the excessive logging),
> or b) spend some time researching stonith to figure out what it
> does and how to configure it. Thanks for the pointers.

 The system is not working perfectly. Consider it like this; You're
 flying, and your landing gears are 

Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-05 Thread Dimitri Maziuk
On 10/05/2016 12:19 PM, Digimer wrote:

> Explain why this is a bad idea, because I don't see anything wrong with it.

My point exactly.
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-05 Thread Digimer
On 05/10/16 01:14 PM, Dimitri Maziuk wrote:
> On 10/05/2016 11:56 AM, Israel Brewster wrote:
> 
>> As you say, though, this is something I'll simply need to get over if I want 
>> real HA
> 
> The sad truth is making simple stupid stuff that Just Works(tm) is not
> cool. Making stuff that will run a cluster of 1001 randomly mixed
> active, somewhat-active, mostly-passive, etc. nodes, power-off anything
> it doesn't like, when that fails: fence it with the Lights-Out
> Management System Du Jour, when that fails: turn the power off at the
> networked PDUs... and bring you warmed-up slippers in the morning, now
> that's cool.

If you have "1001 randomly mixed ..." services, you might want to break
up your software into smaller clusters. Also, iLO, DRAC, iRMC, RSA...
They're all basically IPMI plus some vendor features. Not sure why you'd
refer to them as "System Du Jour"...

> And when you ask: if there's only two node and one can't talk to the
> other, how does it know that it's the other node and not itself that
> needs to be fenced? The "cool" developers answer: well, we just add a
> delay so they don't try to fence each other at the same time.
> 
> D'oh.
> 

Explain why this is a bad idea, because I don't see anything wrong with it.

> I think your problem is centos 6. Either switch to 7 or ditch pacemaker
> and go heartbeat in haresources mode + mon and a little perl scripting.
> I'm running both, the haresources version. I get about 1 instance of the
> scary split brain per 2 cluster/years and almost all of them are caused
> by me doing something stupid.

That is an insane recommendation. Heartbeat has been deprecated for many
years. There is no plan to restart development, either. Meanwhile,
CentOS/RHEL 6 is perfectly fine and stable and will be supported until
at least 2020.

https://alteeve.ca/w/History_of_HA_Clustering

"scary split brain per 2 cluster/years"

Split-brains are about the worst thing that can happen in HA. At the
very best, you lose your services. At worst, you corrupt your data. Why
risk that at all when fencing solves the problem perfectly fine?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-04 Thread Digimer
On 04/10/16 07:50 PM, Israel Brewster wrote:
> On Oct 4, 2016, at 3:38 PM, Digimer  wrote:
>>
>> On 04/10/16 07:09 PM, Israel Brewster wrote:
>>> On Oct 4, 2016, at 3:03 PM, Digimer  wrote:

 On 04/10/16 06:50 PM, Israel Brewster wrote:
> On Oct 4, 2016, at 2:26 PM, Ken Gaillot  > wrote:
>>
>> On 10/04/2016 11:31 AM, Israel Brewster wrote:
>>> I sent this a week ago, but never got a response, so I'm sending it
>>> again in the hopes that it just slipped through the cracks. It seems to
>>> me that this should just be a simple mis-configuration on my part
>>> causing the issue, but I suppose it could be a bug as well.
>>>
>>> I have two two-node clusters set up using corosync/pacemaker on CentOS
>>> 6.8. One cluster is simply sharing an IP, while the other one has
>>> numerous services and IP's set up between the two machines in the
>>> cluster. Both appear to be working fine. However, I was poking around
>>> today, and I noticed that on the single IP cluster, corosync, stonithd,
>>> and fenced were using "significant" amounts of processing power - 25%
>>> for corosync on the current primary node, with fenced and stonithd often
>>> showing 1-2% (not horrible, but more than any other process). In looking
>>> at my logs, I see that they are dumping messages like the following to
>>> the messages log every second or two:
>>>
>>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
>>> No match for //@st_delegate in /st-reply
>>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
>>> Operation reboot of fai-dbs1 by fai-dbs2 for
>>> stonith_admin.cman.15835@fai-dbs2.c5161517: No such device
>>> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
>>> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
>>> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
>>> stonith_admin.cman.15835
>>> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
>>> fai-dbs2 (reset)
>>
>> The above shows that CMAN is asking pacemaker to fence a node. Even
>> though fencing is disabled in pacemaker itself, CMAN is configured to
>> use pacemaker for fencing (fence_pcmk).
>
> I never did any specific configuring of CMAN, Perhaps that's the
> problem? I missed some configuration steps on setup? I just followed the
> directions
> here: 
> http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs,
> which disabled stonith in pacemaker via the
> "pcs property set stonith-enabled=false" command. Is there separate CMAN
> configs I need to do to get everything copacetic? If so, can you point
> me to some sort of guide/tutorial for that?

 Disabling stonith is not possible in cman, and very ill advised in
 pacemaker. This is a mistake a lot of "tutorials" make when the author
 doesn't understand the role of fencing.

 In your case, pcs setup cman to use the fence_pcmk "passthrough" fence
 agent, as it should. So when something went wrong, corosync detected it,
 informed cman which then requested pacemaker to fence the peer. With
 pacemaker not having stonith configured and enabled, it could do
 nothing. So pacemaker returned that the fence failed and cman went into
 an infinite loop trying again and again to fence (as it should have).

 You must configure stonith (exactly how depends on your hardware), then
 enable stonith in pacemaker.

>>>
>>> Gotcha. There is nothing special about the hardware, it's just two physical 
>>> boxes connected to the network. So I guess I've got a choice of either a) 
>>> live with the logging/load situation (since the system does work perfectly 
>>> as-is other than the excessive logging), or b) spend some time researching 
>>> stonith to figure out what it does and how to configure it. Thanks for the 
>>> pointers.
>>
>> The system is not working perfectly. Consider it like this; You're
>> flying, and your landing gears are busted. You think everything is fine
>> because you're not trying to land yet.
> 
> Ok, good analogy :-)
> 
>>
>> Fencing is needed to force a node that has entered into a known state
>> into a known state (usually 'off'). It does this by reaching out over
>> some independent mechanism, like IPMI or a switched PDU, and forcing the
>> target to shut down.
> 
> Yeah, I don't want that. If one of the nodes enters an unknown state, I want 
> the system to notify me so I can decide the proper course of action - I don't 
> want it to simply shut down the other machine or something.

You do, actually. If a node isn't readily disposable, you need to
rethink your HA strategy. The service you're protecting is what matters,
not the 

Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-04 Thread Israel Brewster
On Oct 4, 2016, at 3:38 PM, Digimer  wrote:
> 
> On 04/10/16 07:09 PM, Israel Brewster wrote:
>> On Oct 4, 2016, at 3:03 PM, Digimer  wrote:
>>> 
>>> On 04/10/16 06:50 PM, Israel Brewster wrote:
 On Oct 4, 2016, at 2:26 PM, Ken Gaillot > wrote:
> 
> On 10/04/2016 11:31 AM, Israel Brewster wrote:
>> I sent this a week ago, but never got a response, so I'm sending it
>> again in the hopes that it just slipped through the cracks. It seems to
>> me that this should just be a simple mis-configuration on my part
>> causing the issue, but I suppose it could be a bug as well.
>> 
>> I have two two-node clusters set up using corosync/pacemaker on CentOS
>> 6.8. One cluster is simply sharing an IP, while the other one has
>> numerous services and IP's set up between the two machines in the
>> cluster. Both appear to be working fine. However, I was poking around
>> today, and I noticed that on the single IP cluster, corosync, stonithd,
>> and fenced were using "significant" amounts of processing power - 25%
>> for corosync on the current primary node, with fenced and stonithd often
>> showing 1-2% (not horrible, but more than any other process). In looking
>> at my logs, I see that they are dumping messages like the following to
>> the messages log every second or two:
>> 
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
>> No match for //@st_delegate in /st-reply
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
>> Operation reboot of fai-dbs1 by fai-dbs2 for
>> stonith_admin.cman.15835@fai-dbs2.c5161517: No such device
>> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
>> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
>> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
>> stonith_admin.cman.15835
>> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
>> fai-dbs2 (reset)
> 
> The above shows that CMAN is asking pacemaker to fence a node. Even
> though fencing is disabled in pacemaker itself, CMAN is configured to
> use pacemaker for fencing (fence_pcmk).
 
 I never did any specific configuring of CMAN, Perhaps that's the
 problem? I missed some configuration steps on setup? I just followed the
 directions
 here: 
 http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs,
 which disabled stonith in pacemaker via the
 "pcs property set stonith-enabled=false" command. Is there separate CMAN
 configs I need to do to get everything copacetic? If so, can you point
 me to some sort of guide/tutorial for that?
>>> 
>>> Disabling stonith is not possible in cman, and very ill advised in
>>> pacemaker. This is a mistake a lot of "tutorials" make when the author
>>> doesn't understand the role of fencing.
>>> 
>>> In your case, pcs setup cman to use the fence_pcmk "passthrough" fence
>>> agent, as it should. So when something went wrong, corosync detected it,
>>> informed cman which then requested pacemaker to fence the peer. With
>>> pacemaker not having stonith configured and enabled, it could do
>>> nothing. So pacemaker returned that the fence failed and cman went into
>>> an infinite loop trying again and again to fence (as it should have).
>>> 
>>> You must configure stonith (exactly how depends on your hardware), then
>>> enable stonith in pacemaker.
>>> 
>> 
>> Gotcha. There is nothing special about the hardware, it's just two physical 
>> boxes connected to the network. So I guess I've got a choice of either a) 
>> live with the logging/load situation (since the system does work perfectly 
>> as-is other than the excessive logging), or b) spend some time researching 
>> stonith to figure out what it does and how to configure it. Thanks for the 
>> pointers.
> 
> The system is not working perfectly. Consider it like this; You're
> flying, and your landing gears are busted. You think everything is fine
> because you're not trying to land yet.

Ok, good analogy :-)

> 
> Fencing is needed to force a node that has entered into a known state
> into a known state (usually 'off'). It does this by reaching out over
> some independent mechanism, like IPMI or a switched PDU, and forcing the
> target to shut down.

Yeah, I don't want that. If one of the nodes enters an unknown state, I want 
the system to notify me so I can decide the proper course of action - I don't 
want it to simply shut down the other machine or something.

> This is also why I said that your hardware matters.
> Do your nodes have IPMI? (or iRMC, iLO, DRAC, RSA, etc)?

I *might* have IPMI. I know my newer servers do. I'll have to check on that.

> 
> If you don't need to coordinate actions between the nodes, you don't
> need HA 

Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-04 Thread Digimer
On 04/10/16 07:09 PM, Israel Brewster wrote:
> On Oct 4, 2016, at 3:03 PM, Digimer  wrote:
>>
>> On 04/10/16 06:50 PM, Israel Brewster wrote:
>>> On Oct 4, 2016, at 2:26 PM, Ken Gaillot >> > wrote:

 On 10/04/2016 11:31 AM, Israel Brewster wrote:
> I sent this a week ago, but never got a response, so I'm sending it
> again in the hopes that it just slipped through the cracks. It seems to
> me that this should just be a simple mis-configuration on my part
> causing the issue, but I suppose it could be a bug as well.
>
> I have two two-node clusters set up using corosync/pacemaker on CentOS
> 6.8. One cluster is simply sharing an IP, while the other one has
> numerous services and IP's set up between the two machines in the
> cluster. Both appear to be working fine. However, I was poking around
> today, and I noticed that on the single IP cluster, corosync, stonithd,
> and fenced were using "significant" amounts of processing power - 25%
> for corosync on the current primary node, with fenced and stonithd often
> showing 1-2% (not horrible, but more than any other process). In looking
> at my logs, I see that they are dumping messages like the following to
> the messages log every second or two:
>
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
> No match for //@st_delegate in /st-reply
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
> Operation reboot of fai-dbs1 by fai-dbs2 for
> stonith_admin.cman.15835@fai-dbs2.c5161517: No such device
> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
> stonith_admin.cman.15835
> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
> fai-dbs2 (reset)

 The above shows that CMAN is asking pacemaker to fence a node. Even
 though fencing is disabled in pacemaker itself, CMAN is configured to
 use pacemaker for fencing (fence_pcmk).
>>>
>>> I never did any specific configuring of CMAN, Perhaps that's the
>>> problem? I missed some configuration steps on setup? I just followed the
>>> directions
>>> here: 
>>> http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs,
>>> which disabled stonith in pacemaker via the
>>> "pcs property set stonith-enabled=false" command. Is there separate CMAN
>>> configs I need to do to get everything copacetic? If so, can you point
>>> me to some sort of guide/tutorial for that?
>>
>> Disabling stonith is not possible in cman, and very ill advised in
>> pacemaker. This is a mistake a lot of "tutorials" make when the author
>> doesn't understand the role of fencing.
>>
>> In your case, pcs setup cman to use the fence_pcmk "passthrough" fence
>> agent, as it should. So when something went wrong, corosync detected it,
>> informed cman which then requested pacemaker to fence the peer. With
>> pacemaker not having stonith configured and enabled, it could do
>> nothing. So pacemaker returned that the fence failed and cman went into
>> an infinite loop trying again and again to fence (as it should have).
>>
>> You must configure stonith (exactly how depends on your hardware), then
>> enable stonith in pacemaker.
>>
> 
> Gotcha. There is nothing special about the hardware, it's just two physical 
> boxes connected to the network. So I guess I've got a choice of either a) 
> live with the logging/load situation (since the system does work perfectly 
> as-is other than the excessive logging), or b) spend some time researching 
> stonith to figure out what it does and how to configure it. Thanks for the 
> pointers.

The system is not working perfectly. Consider it like this; You're
flying, and your landing gears are busted. You think everything is fine
because you're not trying to land yet.

Fencing is needed to force a node that has entered into a known state
into a known state (usually 'off'). It does this by reaching out over
some independent mechanism, like IPMI or a switched PDU, and forcing the
target to shut down. This is also why I said that your hardware matters.
Do your nodes have IPMI? (or iRMC, iLO, DRAC, RSA, etc)?

If you don't need to coordinate actions between the nodes, you don't
need HA software, just run things everywhere all the time. If, however,
you do need to coordinate actions, then you need fencing.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: 

Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-04 Thread Digimer
On 04/10/16 06:50 PM, Israel Brewster wrote:
> On Oct 4, 2016, at 2:26 PM, Ken Gaillot  > wrote:
>>
>> On 10/04/2016 11:31 AM, Israel Brewster wrote:
>>> I sent this a week ago, but never got a response, so I'm sending it
>>> again in the hopes that it just slipped through the cracks. It seems to
>>> me that this should just be a simple mis-configuration on my part
>>> causing the issue, but I suppose it could be a bug as well.
>>>
>>> I have two two-node clusters set up using corosync/pacemaker on CentOS
>>> 6.8. One cluster is simply sharing an IP, while the other one has
>>> numerous services and IP's set up between the two machines in the
>>> cluster. Both appear to be working fine. However, I was poking around
>>> today, and I noticed that on the single IP cluster, corosync, stonithd,
>>> and fenced were using "significant" amounts of processing power - 25%
>>> for corosync on the current primary node, with fenced and stonithd often
>>> showing 1-2% (not horrible, but more than any other process). In looking
>>> at my logs, I see that they are dumping messages like the following to
>>> the messages log every second or two:
>>>
>>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
>>> No match for //@st_delegate in /st-reply
>>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
>>> Operation reboot of fai-dbs1 by fai-dbs2 for
>>> stonith_admin.cman.15835@fai-dbs2.c5161517: No such device
>>> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
>>> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
>>> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
>>> stonith_admin.cman.15835
>>> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
>>> fai-dbs2 (reset)
>>
>> The above shows that CMAN is asking pacemaker to fence a node. Even
>> though fencing is disabled in pacemaker itself, CMAN is configured to
>> use pacemaker for fencing (fence_pcmk).
> 
> I never did any specific configuring of CMAN, Perhaps that's the
> problem? I missed some configuration steps on setup? I just followed the
> directions
> here: 
> http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs,
> which disabled stonith in pacemaker via the
> "pcs property set stonith-enabled=false" command. Is there separate CMAN
> configs I need to do to get everything copacetic? If so, can you point
> me to some sort of guide/tutorial for that?

Disabling stonith is not possible in cman, and very ill advised in
pacemaker. This is a mistake a lot of "tutorials" make when the author
doesn't understand the role of fencing.

In your case, pcs setup cman to use the fence_pcmk "passthrough" fence
agent, as it should. So when something went wrong, corosync detected it,
informed cman which then requested pacemaker to fence the peer. With
pacemaker not having stonith configured and enabled, it could do
nothing. So pacemaker returned that the fence failed and cman went into
an infinite loop trying again and again to fence (as it should have).

You must configure stonith (exactly how depends on your hardware), then
enable stonith in pacemaker.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-04 Thread Israel Brewster
On Oct 4, 2016, at 2:26 PM, Ken Gaillot  wrote:
> 
> On 10/04/2016 11:31 AM, Israel Brewster wrote:
>> I sent this a week ago, but never got a response, so I'm sending it
>> again in the hopes that it just slipped through the cracks. It seems to
>> me that this should just be a simple mis-configuration on my part
>> causing the issue, but I suppose it could be a bug as well.
>> 
>> I have two two-node clusters set up using corosync/pacemaker on CentOS
>> 6.8. One cluster is simply sharing an IP, while the other one has
>> numerous services and IP's set up between the two machines in the
>> cluster. Both appear to be working fine. However, I was poking around
>> today, and I noticed that on the single IP cluster, corosync, stonithd,
>> and fenced were using "significant" amounts of processing power - 25%
>> for corosync on the current primary node, with fenced and stonithd often
>> showing 1-2% (not horrible, but more than any other process). In looking
>> at my logs, I see that they are dumping messages like the following to
>> the messages log every second or two:
>> 
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
>> No match for //@st_delegate in /st-reply
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
>> Operation reboot of fai-dbs1 by fai-dbs2 for
>> stonith_admin.cman.15835@fai-dbs2.c5161517: No such device
>> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
>> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
>> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
>> stonith_admin.cman.15835
>> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
>> fai-dbs2 (reset)
> 
> The above shows that CMAN is asking pacemaker to fence a node. Even
> though fencing is disabled in pacemaker itself, CMAN is configured to
> use pacemaker for fencing (fence_pcmk).

I never did any specific configuring of CMAN, Perhaps that's the problem? I 
missed some configuration steps on setup? I just followed the directions here: 
http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs
 
,
 which disabled stonith in pacemaker via the "pcs property set 
stonith-enabled=false" command. Is there separate CMAN configs I need to do to 
get everything copacetic? If so, can you point me to some sort of 
guide/tutorial for that?

> 
>> Sep 27 08:51:50 fai-dbs1 stonith_admin[15394]:   notice: crm_log_args:
>> Invoked: stonith_admin --reboot fai-dbs2 --tolerance 5s --tag cman 
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: handle_request:
>> Client stonith_admin.cman.15394.2a97d89d wants to fence (reboot)
>> 'fai-dbs2' with device '(any)'
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice:
>> initiate_remote_stonith_op: Initiating remote operation reboot for
>> fai-dbs2: bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae (0)
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice:
>> stonith_choose_peer: Couldn't find anyone to fence fai-dbs2 with 
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
>> No match for //@st_delegate in /st-reply
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:error: remote_op_done:
>> Operation reboot of fai-dbs2 by fai-dbs1 for
>> stonith_admin.cman.15394@fai-dbs1.bc3f5d73: No such device
>> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
>> Peer fai-dbs2 was not terminated (reboot) by fai-dbs1 for fai-dbs1: No
>> such device (ref=bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae) by client
>> stonith_admin.cman.15394
>> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Call to fence fai-dbs2
>> (reset) failed with rc=237
>> 
>> After seeing this one the one cluster, I checked the logs on the other
>> and sure enough I'm seeing the same thing there. As I mentioned, both
>> nodes in both clusters *appear* to be operating correctly. For example,
>> the output of "pcs status" on the small cluster is this:
>> 
>> [root@fai-dbs1 ~]# pcs status
>> Cluster name: dbs_cluster
>> Last updated: Tue Sep 27 08:59:44 2016
>> Last change: Thu Mar  3 06:11:00 2016
>> Stack: cman
>> Current DC: fai-dbs1 - partition with quorum
>> Version: 1.1.11-97629de
>> 2 Nodes configured
>> 1 Resources configured
>> 
>> 
>> Online: [ fai-dbs1 fai-dbs2 ]
>> 
>> Full list of resources:
>> 
>> virtual_ip(ocf::heartbeat:IPaddr2):Started fai-dbs1
>> 
>> And on the larger cluster, it has services running across both nodes of
>> the cluster, and I've been able to move stuff back and forth without
>> issue. Both nodes have the stonith-enabled property set to false, and
>> no-quorum-policy set to ignore (since they are only two nodes in the
>> cluster).
>> 
>> What could be causing the log messages? Is the CPU usage normal, or
>> might there be something I can do about that as well? Thanks.
> 
> It's not normal; most likely, the failed 

Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-04 Thread Ken Gaillot
On 10/04/2016 11:31 AM, Israel Brewster wrote:
> I sent this a week ago, but never got a response, so I'm sending it
> again in the hopes that it just slipped through the cracks. It seems to
> me that this should just be a simple mis-configuration on my part
> causing the issue, but I suppose it could be a bug as well.
> 
> I have two two-node clusters set up using corosync/pacemaker on CentOS
> 6.8. One cluster is simply sharing an IP, while the other one has
> numerous services and IP's set up between the two machines in the
> cluster. Both appear to be working fine. However, I was poking around
> today, and I noticed that on the single IP cluster, corosync, stonithd,
> and fenced were using "significant" amounts of processing power - 25%
> for corosync on the current primary node, with fenced and stonithd often
> showing 1-2% (not horrible, but more than any other process). In looking
> at my logs, I see that they are dumping messages like the following to
> the messages log every second or two:
> 
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
> No match for //@st_delegate in /st-reply
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
> Operation reboot of fai-dbs1 by fai-dbs2 for
> stonith_admin.cman.15835@fai-dbs2.c5161517: No such device
> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
> stonith_admin.cman.15835
> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
> fai-dbs2 (reset)

The above shows that CMAN is asking pacemaker to fence a node. Even
though fencing is disabled in pacemaker itself, CMAN is configured to
use pacemaker for fencing (fence_pcmk).

> Sep 27 08:51:50 fai-dbs1 stonith_admin[15394]:   notice: crm_log_args:
> Invoked: stonith_admin --reboot fai-dbs2 --tolerance 5s --tag cman 
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: handle_request:
> Client stonith_admin.cman.15394.2a97d89d wants to fence (reboot)
> 'fai-dbs2' with device '(any)'
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice:
> initiate_remote_stonith_op: Initiating remote operation reboot for
> fai-dbs2: bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae (0)
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice:
> stonith_choose_peer: Couldn't find anyone to fence fai-dbs2 with 
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
> No match for //@st_delegate in /st-reply
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:error: remote_op_done:
> Operation reboot of fai-dbs2 by fai-dbs1 for
> stonith_admin.cman.15394@fai-dbs1.bc3f5d73: No such device
> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
> Peer fai-dbs2 was not terminated (reboot) by fai-dbs1 for fai-dbs1: No
> such device (ref=bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae) by client
> stonith_admin.cman.15394
> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Call to fence fai-dbs2
> (reset) failed with rc=237
> 
> After seeing this one the one cluster, I checked the logs on the other
> and sure enough I'm seeing the same thing there. As I mentioned, both
> nodes in both clusters *appear* to be operating correctly. For example,
> the output of "pcs status" on the small cluster is this:
> 
> [root@fai-dbs1 ~]# pcs status
> Cluster name: dbs_cluster
> Last updated: Tue Sep 27 08:59:44 2016
> Last change: Thu Mar  3 06:11:00 2016
> Stack: cman
> Current DC: fai-dbs1 - partition with quorum
> Version: 1.1.11-97629de
> 2 Nodes configured
> 1 Resources configured
> 
> 
> Online: [ fai-dbs1 fai-dbs2 ]
> 
> Full list of resources:
> 
>  virtual_ip(ocf::heartbeat:IPaddr2):Started fai-dbs1
> 
> And on the larger cluster, it has services running across both nodes of
> the cluster, and I've been able to move stuff back and forth without
> issue. Both nodes have the stonith-enabled property set to false, and
> no-quorum-policy set to ignore (since they are only two nodes in the
> cluster).
> 
> What could be causing the log messages? Is the CPU usage normal, or
> might there be something I can do about that as well? Thanks.

It's not normal; most likely, the failed fencing is being retried endlessly.

You'll want to figure out why CMAN is asking for fencing. You may have
some sort of communication problem between the nodes (that might be a
factor in corosync's CPU usage, too).

Once that's straightened out, it's a good idea to actually configure and
enable fencing :)


> 
> ---
> Israel Brewster
> Systems Analyst II
> Ravn Alaska
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709
> (907) 450-7293
> ---

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: