Re: [ClusterLabs] IPaddr2, interval between unsolicited ARP packets

2016-10-04 Thread Shinjiro Hamaguchi
Matsushima-san

Thank you very much for your reply.
And sorry for late reply.


>Do you get same result by executing the command manually with different
parameters like this?
I tried following command but same result (1sec interval)

 [command used to send unsolicited arp]
/usr/libexec/heartbeat/send_arp -i 1500 -r 8 eth0 192.168.12.215 auto
not_used not_used

 [result of tcudump]
04:31:50.475928 ARP, Request who-has 192.168.12.215 (Broadcast) tell
192.168.12.215, length 28
04:31:51.476053 ARP, Request who-has 192.168.12.215 (Broadcast) tell
192.168.12.215, length 28
04:31:52.476146 ARP, Request who-has 192.168.12.215 (Broadcast) tell
192.168.12.215, length 28
04:31:53.476246 ARP, Request who-has 192.168.12.215 (Broadcast) tell
192.168.12.215, length 28
04:31:54.476287 ARP, Request who-has 192.168.12.215 (Broadcast) tell
192.168.12.215, length 28
04:31:55.476406 ARP, Request who-has 192.168.12.215 (Broadcast) tell
192.168.12.215, length 28
04:31:56.476448 ARP, Request who-has 192.168.12.215 (Broadcast) tell
192.168.12.215, length 28
04:31:57.476572 ARP, Request who-has 192.168.12.215 (Broadcast) tell
192.168.12.215, length 28

>Please also make sure the PID file has been created properly.
When I checked manually with send_arp command, I didn't use send_arp
command with "-p" option.

Even when I did fail-over of IPaddr2resource (not manually execute send_arp
command), I couldn't see pid file generated at /var/run/resource-agents/.
I used following command to see if pid file generated.

watch -n0.1 "ls -la /var/run/resource-agents/"


Thank you in advance.


On Wed, Oct 5, 2016 at 12:20 PM, Digimer  wrote:

>
>
>
>  Forwarded Message 
> Subject: Re: [ClusterLabs] IPaddr2, interval between unsolicited ARP
> packets
> Date: Tue, 4 Oct 2016 11:18:37 +0900
> From: Takehiro Matsushima 
> Reply-To: Cluster Labs - All topics related to open-source clustering
> welcomed 
> To: Cluster Labs - All topics related to open-source clustering welcomed
> 
>
> Hello Hamaguchi-san,
>
> Do you get same result by executing the command manually with
> different parameters like this?
> #/usr/libexec/heartbeat/send_arp -i 1500 -r 8 -p
> /var/run/resource-agents/send_arp-192.168.12.215 eth0 192.168.12.215
> auto not_used not_used
>
> Please also make sure the PID file has been created properly.
>
> Thank you,
>
> Takehiro MATSUSHIMA
>
> 2016-10-03 14:45 GMT+09:00 Shinjiro Hamaguchi :
> > Hello everyone!!
> >
> >
> > I'm using IPaddr2 for VIP.
> >
> > In the IPaddr2 document, it say interval between unsolicited ARP packets
> is
> > default 200msec and can change it using option "-i", but when i check
> with
> > tcpdump, it looks like sending arp every 1000msec fixed.
> >
> > Does someone have any idea ?
> >
> > Thank you in advance.
> >
> >
> > [environment]
> > kvm, centOS 6.8
> > pacemaker-1.1.14-8.el6_8.1.x86_64
> > cman-3.0.12.1-78.el6.x86_64
> > resource-agents-3.9.5-34.el6_8.2.x86_64
> >
> >
> > [command used to send unsolicited arp]
> > NOTE: i got this command from /var/log/cluster/corosync.log
> > #/usr/libexec/heartbeat/send_arp -i 200 -r 5 -p
> > /var/run/resource-agents/send_arp-192.168.12.215 eth0 192.168.12.215
> auto
> > not_used not_used
> >
> > [result of tcudump]
> >
> > #tcpdump arp
> >
> > tcpdump: verbose output suppressed, use -v or -vv for full protocol
> decode
> >
> > listening on eth0, link-type EN10MB (Ethernet), capture size 65535 bytes
> >
> > 05:28:17.267296 ARP, Request who-has 192.168.12.215 (Broadcast) tell
> > 192.168.12.215, length 28
> >
> > 05:28:18.267519 ARP, Request who-has 192.168.12.215 (Broadcast) tell
> > 192.168.12.215, length 28
> >
> > 05:28:19.267638 ARP, Request who-has 192.168.12.215 (Broadcast) tell
> > 192.168.12.215, length 28
> >
> > 05:28:20.267715 ARP, Request who-has 192.168.12.215 (Broadcast) tell
> > 192.168.12.215, length 28
> >
> > 05:28:21.267801 ARP, Request who-has 192.168.12.215 (Broadcast) tell
> > 192.168.12.215, length 28
> >
> >
> >
> > ___
> > Users mailing list: Users@clusterlabs.org
> > http://clusterlabs.org/mailman/listinfo/users
> >
> > Project Home: http://www.clusterlabs.org
> > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> > Bugs: http://bugs.clusterlabs.org
> >
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org
>
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-04 Thread Digimer
On 04/10/16 07:50 PM, Israel Brewster wrote:
> On Oct 4, 2016, at 3:38 PM, Digimer  wrote:
>>
>> On 04/10/16 07:09 PM, Israel Brewster wrote:
>>> On Oct 4, 2016, at 3:03 PM, Digimer  wrote:

 On 04/10/16 06:50 PM, Israel Brewster wrote:
> On Oct 4, 2016, at 2:26 PM, Ken Gaillot  > wrote:
>>
>> On 10/04/2016 11:31 AM, Israel Brewster wrote:
>>> I sent this a week ago, but never got a response, so I'm sending it
>>> again in the hopes that it just slipped through the cracks. It seems to
>>> me that this should just be a simple mis-configuration on my part
>>> causing the issue, but I suppose it could be a bug as well.
>>>
>>> I have two two-node clusters set up using corosync/pacemaker on CentOS
>>> 6.8. One cluster is simply sharing an IP, while the other one has
>>> numerous services and IP's set up between the two machines in the
>>> cluster. Both appear to be working fine. However, I was poking around
>>> today, and I noticed that on the single IP cluster, corosync, stonithd,
>>> and fenced were using "significant" amounts of processing power - 25%
>>> for corosync on the current primary node, with fenced and stonithd often
>>> showing 1-2% (not horrible, but more than any other process). In looking
>>> at my logs, I see that they are dumping messages like the following to
>>> the messages log every second or two:
>>>
>>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
>>> No match for //@st_delegate in /st-reply
>>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
>>> Operation reboot of fai-dbs1 by fai-dbs2 for
>>> stonith_admin.cman.15835@fai-dbs2.c5161517: No such device
>>> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
>>> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
>>> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
>>> stonith_admin.cman.15835
>>> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
>>> fai-dbs2 (reset)
>>
>> The above shows that CMAN is asking pacemaker to fence a node. Even
>> though fencing is disabled in pacemaker itself, CMAN is configured to
>> use pacemaker for fencing (fence_pcmk).
>
> I never did any specific configuring of CMAN, Perhaps that's the
> problem? I missed some configuration steps on setup? I just followed the
> directions
> here: 
> http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs,
> which disabled stonith in pacemaker via the
> "pcs property set stonith-enabled=false" command. Is there separate CMAN
> configs I need to do to get everything copacetic? If so, can you point
> me to some sort of guide/tutorial for that?

 Disabling stonith is not possible in cman, and very ill advised in
 pacemaker. This is a mistake a lot of "tutorials" make when the author
 doesn't understand the role of fencing.

 In your case, pcs setup cman to use the fence_pcmk "passthrough" fence
 agent, as it should. So when something went wrong, corosync detected it,
 informed cman which then requested pacemaker to fence the peer. With
 pacemaker not having stonith configured and enabled, it could do
 nothing. So pacemaker returned that the fence failed and cman went into
 an infinite loop trying again and again to fence (as it should have).

 You must configure stonith (exactly how depends on your hardware), then
 enable stonith in pacemaker.

>>>
>>> Gotcha. There is nothing special about the hardware, it's just two physical 
>>> boxes connected to the network. So I guess I've got a choice of either a) 
>>> live with the logging/load situation (since the system does work perfectly 
>>> as-is other than the excessive logging), or b) spend some time researching 
>>> stonith to figure out what it does and how to configure it. Thanks for the 
>>> pointers.
>>
>> The system is not working perfectly. Consider it like this; You're
>> flying, and your landing gears are busted. You think everything is fine
>> because you're not trying to land yet.
> 
> Ok, good analogy :-)
> 
>>
>> Fencing is needed to force a node that has entered into a known state
>> into a known state (usually 'off'). It does this by reaching out over
>> some independent mechanism, like IPMI or a switched PDU, and forcing the
>> target to shut down.
> 
> Yeah, I don't want that. If one of the nodes enters an unknown state, I want 
> the system to notify me so I can decide the proper course of action - I don't 
> want it to simply shut down the other machine or something.

You do, actually. If a node isn't readily disposable, you need to
rethink your HA strategy. The service you're protecting is what matters,
not the 

Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-04 Thread Israel Brewster
On Oct 4, 2016, at 3:38 PM, Digimer  wrote:
> 
> On 04/10/16 07:09 PM, Israel Brewster wrote:
>> On Oct 4, 2016, at 3:03 PM, Digimer  wrote:
>>> 
>>> On 04/10/16 06:50 PM, Israel Brewster wrote:
 On Oct 4, 2016, at 2:26 PM, Ken Gaillot > wrote:
> 
> On 10/04/2016 11:31 AM, Israel Brewster wrote:
>> I sent this a week ago, but never got a response, so I'm sending it
>> again in the hopes that it just slipped through the cracks. It seems to
>> me that this should just be a simple mis-configuration on my part
>> causing the issue, but I suppose it could be a bug as well.
>> 
>> I have two two-node clusters set up using corosync/pacemaker on CentOS
>> 6.8. One cluster is simply sharing an IP, while the other one has
>> numerous services and IP's set up between the two machines in the
>> cluster. Both appear to be working fine. However, I was poking around
>> today, and I noticed that on the single IP cluster, corosync, stonithd,
>> and fenced were using "significant" amounts of processing power - 25%
>> for corosync on the current primary node, with fenced and stonithd often
>> showing 1-2% (not horrible, but more than any other process). In looking
>> at my logs, I see that they are dumping messages like the following to
>> the messages log every second or two:
>> 
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
>> No match for //@st_delegate in /st-reply
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
>> Operation reboot of fai-dbs1 by fai-dbs2 for
>> stonith_admin.cman.15835@fai-dbs2.c5161517: No such device
>> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
>> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
>> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
>> stonith_admin.cman.15835
>> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
>> fai-dbs2 (reset)
> 
> The above shows that CMAN is asking pacemaker to fence a node. Even
> though fencing is disabled in pacemaker itself, CMAN is configured to
> use pacemaker for fencing (fence_pcmk).
 
 I never did any specific configuring of CMAN, Perhaps that's the
 problem? I missed some configuration steps on setup? I just followed the
 directions
 here: 
 http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs,
 which disabled stonith in pacemaker via the
 "pcs property set stonith-enabled=false" command. Is there separate CMAN
 configs I need to do to get everything copacetic? If so, can you point
 me to some sort of guide/tutorial for that?
>>> 
>>> Disabling stonith is not possible in cman, and very ill advised in
>>> pacemaker. This is a mistake a lot of "tutorials" make when the author
>>> doesn't understand the role of fencing.
>>> 
>>> In your case, pcs setup cman to use the fence_pcmk "passthrough" fence
>>> agent, as it should. So when something went wrong, corosync detected it,
>>> informed cman which then requested pacemaker to fence the peer. With
>>> pacemaker not having stonith configured and enabled, it could do
>>> nothing. So pacemaker returned that the fence failed and cman went into
>>> an infinite loop trying again and again to fence (as it should have).
>>> 
>>> You must configure stonith (exactly how depends on your hardware), then
>>> enable stonith in pacemaker.
>>> 
>> 
>> Gotcha. There is nothing special about the hardware, it's just two physical 
>> boxes connected to the network. So I guess I've got a choice of either a) 
>> live with the logging/load situation (since the system does work perfectly 
>> as-is other than the excessive logging), or b) spend some time researching 
>> stonith to figure out what it does and how to configure it. Thanks for the 
>> pointers.
> 
> The system is not working perfectly. Consider it like this; You're
> flying, and your landing gears are busted. You think everything is fine
> because you're not trying to land yet.

Ok, good analogy :-)

> 
> Fencing is needed to force a node that has entered into a known state
> into a known state (usually 'off'). It does this by reaching out over
> some independent mechanism, like IPMI or a switched PDU, and forcing the
> target to shut down.

Yeah, I don't want that. If one of the nodes enters an unknown state, I want 
the system to notify me so I can decide the proper course of action - I don't 
want it to simply shut down the other machine or something.

> This is also why I said that your hardware matters.
> Do your nodes have IPMI? (or iRMC, iLO, DRAC, RSA, etc)?

I *might* have IPMI. I know my newer servers do. I'll have to check on that.

> 
> If you don't need to coordinate actions between the nodes, you don't
> need HA 

Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-04 Thread Digimer
On 04/10/16 07:09 PM, Israel Brewster wrote:
> On Oct 4, 2016, at 3:03 PM, Digimer  wrote:
>>
>> On 04/10/16 06:50 PM, Israel Brewster wrote:
>>> On Oct 4, 2016, at 2:26 PM, Ken Gaillot >> > wrote:

 On 10/04/2016 11:31 AM, Israel Brewster wrote:
> I sent this a week ago, but never got a response, so I'm sending it
> again in the hopes that it just slipped through the cracks. It seems to
> me that this should just be a simple mis-configuration on my part
> causing the issue, but I suppose it could be a bug as well.
>
> I have two two-node clusters set up using corosync/pacemaker on CentOS
> 6.8. One cluster is simply sharing an IP, while the other one has
> numerous services and IP's set up between the two machines in the
> cluster. Both appear to be working fine. However, I was poking around
> today, and I noticed that on the single IP cluster, corosync, stonithd,
> and fenced were using "significant" amounts of processing power - 25%
> for corosync on the current primary node, with fenced and stonithd often
> showing 1-2% (not horrible, but more than any other process). In looking
> at my logs, I see that they are dumping messages like the following to
> the messages log every second or two:
>
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
> No match for //@st_delegate in /st-reply
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
> Operation reboot of fai-dbs1 by fai-dbs2 for
> stonith_admin.cman.15835@fai-dbs2.c5161517: No such device
> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
> stonith_admin.cman.15835
> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
> fai-dbs2 (reset)

 The above shows that CMAN is asking pacemaker to fence a node. Even
 though fencing is disabled in pacemaker itself, CMAN is configured to
 use pacemaker for fencing (fence_pcmk).
>>>
>>> I never did any specific configuring of CMAN, Perhaps that's the
>>> problem? I missed some configuration steps on setup? I just followed the
>>> directions
>>> here: 
>>> http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs,
>>> which disabled stonith in pacemaker via the
>>> "pcs property set stonith-enabled=false" command. Is there separate CMAN
>>> configs I need to do to get everything copacetic? If so, can you point
>>> me to some sort of guide/tutorial for that?
>>
>> Disabling stonith is not possible in cman, and very ill advised in
>> pacemaker. This is a mistake a lot of "tutorials" make when the author
>> doesn't understand the role of fencing.
>>
>> In your case, pcs setup cman to use the fence_pcmk "passthrough" fence
>> agent, as it should. So when something went wrong, corosync detected it,
>> informed cman which then requested pacemaker to fence the peer. With
>> pacemaker not having stonith configured and enabled, it could do
>> nothing. So pacemaker returned that the fence failed and cman went into
>> an infinite loop trying again and again to fence (as it should have).
>>
>> You must configure stonith (exactly how depends on your hardware), then
>> enable stonith in pacemaker.
>>
> 
> Gotcha. There is nothing special about the hardware, it's just two physical 
> boxes connected to the network. So I guess I've got a choice of either a) 
> live with the logging/load situation (since the system does work perfectly 
> as-is other than the excessive logging), or b) spend some time researching 
> stonith to figure out what it does and how to configure it. Thanks for the 
> pointers.

The system is not working perfectly. Consider it like this; You're
flying, and your landing gears are busted. You think everything is fine
because you're not trying to land yet.

Fencing is needed to force a node that has entered into a known state
into a known state (usually 'off'). It does this by reaching out over
some independent mechanism, like IPMI or a switched PDU, and forcing the
target to shut down. This is also why I said that your hardware matters.
Do your nodes have IPMI? (or iRMC, iLO, DRAC, RSA, etc)?

If you don't need to coordinate actions between the nodes, you don't
need HA software, just run things everywhere all the time. If, however,
you do need to coordinate actions, then you need fencing.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: 

Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-04 Thread Digimer
On 04/10/16 06:50 PM, Israel Brewster wrote:
> On Oct 4, 2016, at 2:26 PM, Ken Gaillot  > wrote:
>>
>> On 10/04/2016 11:31 AM, Israel Brewster wrote:
>>> I sent this a week ago, but never got a response, so I'm sending it
>>> again in the hopes that it just slipped through the cracks. It seems to
>>> me that this should just be a simple mis-configuration on my part
>>> causing the issue, but I suppose it could be a bug as well.
>>>
>>> I have two two-node clusters set up using corosync/pacemaker on CentOS
>>> 6.8. One cluster is simply sharing an IP, while the other one has
>>> numerous services and IP's set up between the two machines in the
>>> cluster. Both appear to be working fine. However, I was poking around
>>> today, and I noticed that on the single IP cluster, corosync, stonithd,
>>> and fenced were using "significant" amounts of processing power - 25%
>>> for corosync on the current primary node, with fenced and stonithd often
>>> showing 1-2% (not horrible, but more than any other process). In looking
>>> at my logs, I see that they are dumping messages like the following to
>>> the messages log every second or two:
>>>
>>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
>>> No match for //@st_delegate in /st-reply
>>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
>>> Operation reboot of fai-dbs1 by fai-dbs2 for
>>> stonith_admin.cman.15835@fai-dbs2.c5161517: No such device
>>> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
>>> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
>>> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
>>> stonith_admin.cman.15835
>>> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
>>> fai-dbs2 (reset)
>>
>> The above shows that CMAN is asking pacemaker to fence a node. Even
>> though fencing is disabled in pacemaker itself, CMAN is configured to
>> use pacemaker for fencing (fence_pcmk).
> 
> I never did any specific configuring of CMAN, Perhaps that's the
> problem? I missed some configuration steps on setup? I just followed the
> directions
> here: 
> http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs,
> which disabled stonith in pacemaker via the
> "pcs property set stonith-enabled=false" command. Is there separate CMAN
> configs I need to do to get everything copacetic? If so, can you point
> me to some sort of guide/tutorial for that?

Disabling stonith is not possible in cman, and very ill advised in
pacemaker. This is a mistake a lot of "tutorials" make when the author
doesn't understand the role of fencing.

In your case, pcs setup cman to use the fence_pcmk "passthrough" fence
agent, as it should. So when something went wrong, corosync detected it,
informed cman which then requested pacemaker to fence the peer. With
pacemaker not having stonith configured and enabled, it could do
nothing. So pacemaker returned that the fence failed and cman went into
an infinite loop trying again and again to fence (as it should have).

You must configure stonith (exactly how depends on your hardware), then
enable stonith in pacemaker.

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without
access to education?

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-04 Thread Israel Brewster
On Oct 4, 2016, at 2:26 PM, Ken Gaillot  wrote:
> 
> On 10/04/2016 11:31 AM, Israel Brewster wrote:
>> I sent this a week ago, but never got a response, so I'm sending it
>> again in the hopes that it just slipped through the cracks. It seems to
>> me that this should just be a simple mis-configuration on my part
>> causing the issue, but I suppose it could be a bug as well.
>> 
>> I have two two-node clusters set up using corosync/pacemaker on CentOS
>> 6.8. One cluster is simply sharing an IP, while the other one has
>> numerous services and IP's set up between the two machines in the
>> cluster. Both appear to be working fine. However, I was poking around
>> today, and I noticed that on the single IP cluster, corosync, stonithd,
>> and fenced were using "significant" amounts of processing power - 25%
>> for corosync on the current primary node, with fenced and stonithd often
>> showing 1-2% (not horrible, but more than any other process). In looking
>> at my logs, I see that they are dumping messages like the following to
>> the messages log every second or two:
>> 
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
>> No match for //@st_delegate in /st-reply
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
>> Operation reboot of fai-dbs1 by fai-dbs2 for
>> stonith_admin.cman.15835@fai-dbs2.c5161517: No such device
>> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
>> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
>> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
>> stonith_admin.cman.15835
>> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
>> fai-dbs2 (reset)
> 
> The above shows that CMAN is asking pacemaker to fence a node. Even
> though fencing is disabled in pacemaker itself, CMAN is configured to
> use pacemaker for fencing (fence_pcmk).

I never did any specific configuring of CMAN, Perhaps that's the problem? I 
missed some configuration steps on setup? I just followed the directions here: 
http://jensd.be/156/linux/building-a-high-available-failover-cluster-with-pacemaker-corosync-pcs
 
,
 which disabled stonith in pacemaker via the "pcs property set 
stonith-enabled=false" command. Is there separate CMAN configs I need to do to 
get everything copacetic? If so, can you point me to some sort of 
guide/tutorial for that?

> 
>> Sep 27 08:51:50 fai-dbs1 stonith_admin[15394]:   notice: crm_log_args:
>> Invoked: stonith_admin --reboot fai-dbs2 --tolerance 5s --tag cman 
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: handle_request:
>> Client stonith_admin.cman.15394.2a97d89d wants to fence (reboot)
>> 'fai-dbs2' with device '(any)'
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice:
>> initiate_remote_stonith_op: Initiating remote operation reboot for
>> fai-dbs2: bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae (0)
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice:
>> stonith_choose_peer: Couldn't find anyone to fence fai-dbs2 with 
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
>> No match for //@st_delegate in /st-reply
>> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:error: remote_op_done:
>> Operation reboot of fai-dbs2 by fai-dbs1 for
>> stonith_admin.cman.15394@fai-dbs1.bc3f5d73: No such device
>> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
>> Peer fai-dbs2 was not terminated (reboot) by fai-dbs1 for fai-dbs1: No
>> such device (ref=bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae) by client
>> stonith_admin.cman.15394
>> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Call to fence fai-dbs2
>> (reset) failed with rc=237
>> 
>> After seeing this one the one cluster, I checked the logs on the other
>> and sure enough I'm seeing the same thing there. As I mentioned, both
>> nodes in both clusters *appear* to be operating correctly. For example,
>> the output of "pcs status" on the small cluster is this:
>> 
>> [root@fai-dbs1 ~]# pcs status
>> Cluster name: dbs_cluster
>> Last updated: Tue Sep 27 08:59:44 2016
>> Last change: Thu Mar  3 06:11:00 2016
>> Stack: cman
>> Current DC: fai-dbs1 - partition with quorum
>> Version: 1.1.11-97629de
>> 2 Nodes configured
>> 1 Resources configured
>> 
>> 
>> Online: [ fai-dbs1 fai-dbs2 ]
>> 
>> Full list of resources:
>> 
>> virtual_ip(ocf::heartbeat:IPaddr2):Started fai-dbs1
>> 
>> And on the larger cluster, it has services running across both nodes of
>> the cluster, and I've been able to move stuff back and forth without
>> issue. Both nodes have the stonith-enabled property set to false, and
>> no-quorum-policy set to ignore (since they are only two nodes in the
>> cluster).
>> 
>> What could be causing the log messages? Is the CPU usage normal, or
>> might there be something I can do about that as well? Thanks.
> 
> It's not normal; most likely, the failed 

Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-10-04 Thread Andrew Beekhof
On Wed, Oct 5, 2016 at 7:03 AM, Ken Gaillot  wrote:

> On 10/02/2016 10:02 PM, Andrew Beekhof wrote:
> >> Take a
> >> look at all of nagios' options for deciding when a failure becomes
> "real".
> >
> > I used to take a very hard line on this: if you don't want the cluster
> > to do anything about an error, don't tell us about it.
> > However I'm slowly changing my position... the reality is that many
> > people do want a heads up in advance and we have been forcing that
> > policy (when does an error become real) into the agents where one size
> > must fit all.
> >
> > So I'm now generally in favour of having the PE handle this "somehow".
>
> Nagios is a useful comparison:
>
> check_interval - like pacemaker's monitor interval
>
> retry_interval - if a check returns failure, switch to this interval
> (i.e. check more frequently when trying to decide whether it's a "hard"
> failure)
>
> max_check_attempts - if a check fails this many times in a row, it's a
> hard failure. Before this is reached, it's considered a soft failure.
> Nagios will call event handlers (comparable to pacemaker's alert agents)
> for both soft and hard failures (distinguishing the two). A service is
> also considered to have a "hard failure" if its host goes down.
>
> high_flap_threshold/low_flap_threshold - a service is considered to be
> flapping when its percent of state changes (ok <-> not ok) in the last
> 21 checks (= max. 20 state changes) reaches high_flap_threshold, and
> stable again once the percentage drops to low_flap_threshold. To put it
> another way, a service that passes every monitor is 0% flapping, and a
> service that fails every other monitor is 100% flapping. With these,
> even if a service never reaches max_check_attempts failures in a row, an
> alert can be sent if it's repeatedly failing and recovering.
>

makes sense.

since we're overhauling this functionality anyway, do you think we need to
add an equivalent of retry_interval too?


>
> >> If you clear failures after a success, you can't detect/recover a
> >> resource that is flapping.
> >
> > Ah, but you can if the thing you're clearing only applies to other
> > failures of the same action.
> > A completed start doesn't clear a previously failed monitor.
>
> Nope -- a monitor can alternately succeed and fail repeatedly, and that
> indicates a problem, but wouldn't trip an "N-failures-in-a-row" system.
>
> >> It only makes sense to escalate from ignore -> restart -> hard, so maybe
> >> something like:
> >>
> >>   op monitor ignore-fail=3 soft-fail=2 on-hard-fail=ban
> >>
> > I would favour something more concrete than 'soft' and 'hard' here.
> > Do they have a sufficiently obvious meaning outside of us developers?
> >
> > Perhaps (with or without a "failures-" prefix) :
> >
> >ignore-count
> >recover-count
> >escalation-policy
>
> I think the "soft" vs "hard" terminology is somewhat familiar to
> sysadmins -- there's at least nagios, email (SPF failures and bounces),
> and ECC RAM. But throwing "ignore" into the mix does confuse things.
>
> How about ... max-fail-ignore=3 max-fail-restart=2 fail-escalation=ban
>
>
I could live with that :-)
___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] stonithd/fenced filling up logs

2016-10-04 Thread Ken Gaillot
On 10/04/2016 11:31 AM, Israel Brewster wrote:
> I sent this a week ago, but never got a response, so I'm sending it
> again in the hopes that it just slipped through the cracks. It seems to
> me that this should just be a simple mis-configuration on my part
> causing the issue, but I suppose it could be a bug as well.
> 
> I have two two-node clusters set up using corosync/pacemaker on CentOS
> 6.8. One cluster is simply sharing an IP, while the other one has
> numerous services and IP's set up between the two machines in the
> cluster. Both appear to be working fine. However, I was poking around
> today, and I noticed that on the single IP cluster, corosync, stonithd,
> and fenced were using "significant" amounts of processing power - 25%
> for corosync on the current primary node, with fenced and stonithd often
> showing 1-2% (not horrible, but more than any other process). In looking
> at my logs, I see that they are dumping messages like the following to
> the messages log every second or two:
> 
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
> No match for //@st_delegate in /st-reply
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done:
> Operation reboot of fai-dbs1 by fai-dbs2 for
> stonith_admin.cman.15835@fai-dbs2.c5161517: No such device
> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
> Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No
> such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client
> stonith_admin.cman.15835
> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence
> fai-dbs2 (reset)

The above shows that CMAN is asking pacemaker to fence a node. Even
though fencing is disabled in pacemaker itself, CMAN is configured to
use pacemaker for fencing (fence_pcmk).

> Sep 27 08:51:50 fai-dbs1 stonith_admin[15394]:   notice: crm_log_args:
> Invoked: stonith_admin --reboot fai-dbs2 --tolerance 5s --tag cman 
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: handle_request:
> Client stonith_admin.cman.15394.2a97d89d wants to fence (reboot)
> 'fai-dbs2' with device '(any)'
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice:
> initiate_remote_stonith_op: Initiating remote operation reboot for
> fai-dbs2: bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae (0)
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice:
> stonith_choose_peer: Couldn't find anyone to fence fai-dbs2 with 
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object:
> No match for //@st_delegate in /st-reply
> Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:error: remote_op_done:
> Operation reboot of fai-dbs2 by fai-dbs1 for
> stonith_admin.cman.15394@fai-dbs1.bc3f5d73: No such device
> Sep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify:
> Peer fai-dbs2 was not terminated (reboot) by fai-dbs1 for fai-dbs1: No
> such device (ref=bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae) by client
> stonith_admin.cman.15394
> Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Call to fence fai-dbs2
> (reset) failed with rc=237
> 
> After seeing this one the one cluster, I checked the logs on the other
> and sure enough I'm seeing the same thing there. As I mentioned, both
> nodes in both clusters *appear* to be operating correctly. For example,
> the output of "pcs status" on the small cluster is this:
> 
> [root@fai-dbs1 ~]# pcs status
> Cluster name: dbs_cluster
> Last updated: Tue Sep 27 08:59:44 2016
> Last change: Thu Mar  3 06:11:00 2016
> Stack: cman
> Current DC: fai-dbs1 - partition with quorum
> Version: 1.1.11-97629de
> 2 Nodes configured
> 1 Resources configured
> 
> 
> Online: [ fai-dbs1 fai-dbs2 ]
> 
> Full list of resources:
> 
>  virtual_ip(ocf::heartbeat:IPaddr2):Started fai-dbs1
> 
> And on the larger cluster, it has services running across both nodes of
> the cluster, and I've been able to move stuff back and forth without
> issue. Both nodes have the stonith-enabled property set to false, and
> no-quorum-policy set to ignore (since they are only two nodes in the
> cluster).
> 
> What could be causing the log messages? Is the CPU usage normal, or
> might there be something I can do about that as well? Thanks.

It's not normal; most likely, the failed fencing is being retried endlessly.

You'll want to figure out why CMAN is asking for fencing. You may have
some sort of communication problem between the nodes (that might be a
factor in corosync's CPU usage, too).

Once that's straightened out, it's a good idea to actually configure and
enable fencing :)


> 
> ---
> Israel Brewster
> Systems Analyst II
> Ravn Alaska
> 5245 Airport Industrial Rd
> Fairbanks, AK 99709
> (907) 450-7293
> ---

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: 

[ClusterLabs] Corosync ring shown faulty between healthy nodes & networks (rrp_mode: passive)

2016-10-04 Thread Martin Schlegel
Hello all,

I am trying to understand why the following 2 Corosync heartbeat ring failure
scenarios 
I have been testing and hope somebody can explain why this makes any sense.


Consider the following cluster:

* 3x Nodes: A, B and C
* 2x NICs for each Node
* Corosync 2.3.5 configured with "rrp_mode: passive" and 
  udpu transport with ring id 0 and 1 on each node.
* On each node "corosync-cfgtool -s" shows:
[...] ring 0 active with no faults
[...] ring 1 active with no faults


Consider the following scenarios:

1. On node A only block all communication on the first NIC  configured with
ring id 0
2. On node A only block all communication on all   NICs configured with
ring id 0 and 1


The result of the above scenarios is as follows:

1. Nodes A, B and C (!) display the following ring status:
[...] Marking ringid 0 interface  FAULTY
[...] ring 1 active with no faults
2. Node A is shown as OFFLINE - B and C display the following ring status:
[...] ring 0 active with no faults
[...] ring 1 active with no faults


Questions:
1. Is this the expected outcome ?
2. In experiment 1. B and C can still communicate with each other over both
NICs, so why are 
   B and C not displaying a "no faults" status for ring id 0 and 1 just like
in experiment 2. 
   when node A is completely unreachable ?


Regards,
Martin Schlegel

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] changing constraints and checking quorum at the same time

2016-10-04 Thread Christopher Harvey
I was wondering if it is possible to ask pacemaker to add a resource
constraint and make sure that the majority of the cluster sees this
constraint modification or fail if quorum is not achieved.

This is from within the context of a program issuing pacemaker commands,
not an operator, so race conditions are my main concern.

If I cannot "set-and-check" a constraint modification using pacemaker
alone, is there some kind of sequence of bash commands that I could run
that would guarantee adequate propagation of my constraint?

Thanks,
Chris

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] RFC: allowing soft recovery attempts before ignore/block/etc.

2016-10-04 Thread Ken Gaillot
On 10/02/2016 10:02 PM, Andrew Beekhof wrote:
>> Take a
>> look at all of nagios' options for deciding when a failure becomes "real".
> 
> I used to take a very hard line on this: if you don't want the cluster
> to do anything about an error, don't tell us about it.
> However I'm slowly changing my position... the reality is that many
> people do want a heads up in advance and we have been forcing that
> policy (when does an error become real) into the agents where one size
> must fit all.
> 
> So I'm now generally in favour of having the PE handle this "somehow".

Nagios is a useful comparison:

check_interval - like pacemaker's monitor interval

retry_interval - if a check returns failure, switch to this interval
(i.e. check more frequently when trying to decide whether it's a "hard"
failure)

max_check_attempts - if a check fails this many times in a row, it's a
hard failure. Before this is reached, it's considered a soft failure.
Nagios will call event handlers (comparable to pacemaker's alert agents)
for both soft and hard failures (distinguishing the two). A service is
also considered to have a "hard failure" if its host goes down.

high_flap_threshold/low_flap_threshold - a service is considered to be
flapping when its percent of state changes (ok <-> not ok) in the last
21 checks (= max. 20 state changes) reaches high_flap_threshold, and
stable again once the percentage drops to low_flap_threshold. To put it
another way, a service that passes every monitor is 0% flapping, and a
service that fails every other monitor is 100% flapping. With these,
even if a service never reaches max_check_attempts failures in a row, an
alert can be sent if it's repeatedly failing and recovering.

>> If you clear failures after a success, you can't detect/recover a
>> resource that is flapping.
> 
> Ah, but you can if the thing you're clearing only applies to other
> failures of the same action.
> A completed start doesn't clear a previously failed monitor.

Nope -- a monitor can alternately succeed and fail repeatedly, and that
indicates a problem, but wouldn't trip an "N-failures-in-a-row" system.

>> It only makes sense to escalate from ignore -> restart -> hard, so maybe
>> something like:
>>
>>   op monitor ignore-fail=3 soft-fail=2 on-hard-fail=ban
>>
> I would favour something more concrete than 'soft' and 'hard' here.
> Do they have a sufficiently obvious meaning outside of us developers?
> 
> Perhaps (with or without a "failures-" prefix) :
> 
>ignore-count
>recover-count
>escalation-policy

I think the "soft" vs "hard" terminology is somewhat familiar to
sysadmins -- there's at least nagios, email (SPF failures and bounces),
and ECC RAM. But throwing "ignore" into the mix does confuse things.

How about ... max-fail-ignore=3 max-fail-restart=2 fail-escalation=ban


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] stonithd/fenced filling up logs

2016-10-04 Thread Israel Brewster
I sent this a week ago, but never got a response, so I'm sending it again in the hopes that it just slipped through the cracks. It seems to me that this should just be a simple mis-configuration on my part causing the issue, but I suppose it could be a bug as well.I have two two-node clusters set up using corosync/pacemaker on CentOS 6.8. One cluster is simply sharing an IP, while the other one has numerous services and IP's set up between the two machines in the cluster. Both appear to be working fine. However, I was poking around today, and I noticed that on the single IP cluster, corosync, stonithd, and fenced were using "significant" amounts of processing power - 25% for corosync on the current primary node, with fenced and stonithd often showing 1-2% (not horrible, but more than any other process). In looking at my logs, I see that they are dumping messages like the following to the messages log every second or two:Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object: No match for //@st_delegate in /st-replySep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: remote_op_done: Operation reboot of fai-dbs1 by fai-dbs2 for stonith_admin.cman.15835@fai-dbs2.c5161517: No such deviceSep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify: Peer fai-dbs1 was not terminated (reboot) by fai-dbs2 for fai-dbs2: No such device (ref=c5161517-c0cc-42e5-ac11-1d55f7749b05) by client stonith_admin.cman.15835Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Requesting Pacemaker fence fai-dbs2 (reset)Sep 27 08:51:50 fai-dbs1 stonith_admin[15394]:   notice: crm_log_args: Invoked: stonith_admin --reboot fai-dbs2 --tolerance 5s --tag cman Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: handle_request: Client stonith_admin.cman.15394.2a97d89d wants to fence (reboot) 'fai-dbs2' with device '(any)'Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: initiate_remote_stonith_op: Initiating remote operation reboot for fai-dbs2: bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae (0)Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:   notice: stonith_choose_peer: Couldn't find anyone to fence fai-dbs2 with Sep 27 08:51:50 fai-dbs1 stonith-ng[4851]:  warning: get_xpath_object: No match for //@st_delegate in /st-replySep 27 08:51:50 fai-dbs1 stonith-ng[4851]:    error: remote_op_done: Operation reboot of fai-dbs2 by fai-dbs1 for stonith_admin.cman.15394@fai-dbs1.bc3f5d73: No such deviceSep 27 08:51:50 fai-dbs1 crmd[4855]:   notice: tengine_stonith_notify: Peer fai-dbs2 was not terminated (reboot) by fai-dbs1 for fai-dbs1: No such device (ref=bc3f5d73-57bd-4aff-a94c-f9978aa5c3ae) by client stonith_admin.cman.15394Sep 27 08:51:50 fai-dbs1 fence_pcmk[15393]: Call to fence fai-dbs2 (reset) failed with rc=237After seeing this one the one cluster, I checked the logs on the other and sure enough I'm seeing the same thing there. As I mentioned, both nodes in both clusters *appear* to be operating correctly. For example, the output of "pcs status" on the small cluster is this:[root@fai-dbs1 ~]# pcs statusCluster name: dbs_clusterLast updated: Tue Sep 27 08:59:44 2016Last change: Thu Mar  3 06:11:00 2016Stack: cmanCurrent DC: fai-dbs1 - partition with quorumVersion: 1.1.11-97629de2 Nodes configured1 Resources configuredOnline: [ fai-dbs1 fai-dbs2 ]Full list of resources: virtual_ip	(ocf::heartbeat:IPaddr2):	Started fai-dbs1And on the larger cluster, it has services running across both nodes of the cluster, and I've been able to move stuff back and forth without issue. Both nodes have the stonith-enabled property set to false, and no-quorum-policy set to ignore (since they are only two nodes in the cluster).What could be causing the log messages? Is the CPU usage normal, or might there be something I can do about that as well? Thanks.
---Israel BrewsterSystems Analyst IIRavn Alaska5245 Airport Industrial RdFairbanks, AK 99709(907) 450-7293---BEGIN:VCARD
VERSION:3.0
N:Brewster;Israel;;;
FN:Israel Brewster
ORG:Frontier Flying Service;MIS
TITLE:PC Support Tech II
EMAIL;type=INTERNET;type=WORK;type=pref:isr...@frontierflying.com
TEL;type=WORK;type=pref:907-450-7293
item1.ADR;type=WORK;type=pref:;;5245 Airport Industrial Wy;Fairbanks;AK;99701;
item1.X-ABADR:us
CATEGORIES:General
X-ABUID:36305438-95EA-4410-91AB-45D16CABCDDC\:ABPerson
END:VCARD


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] Antw: Sudden Change of 300GB in FileSystem after Pacemaker Restart?

2016-10-04 Thread Ulrich Windl
>>> Eric Robinson  schrieb am 04.10.2016 um 09:24 in
Nachricht


> The filesystem on my corosync+pacemaker cluster is 1TiB in size and was 95% 
> full, with only 54GB available. Drbd was UpToDate/UpToDate. 
> 
> I restarted Pacemaker, and after that my filesystem now shows 49% full with 
> 300GB+ free space.
> 
> I checked and there does not seem to be any data missing. All MySQL 
> databases are up to date.
> 
> Can anyone think of a reason that the filesystem numbers would change so 
> dramatically when all I did was restart Pacemaker?

An open file had been "removed", so that it was actually removed when the 
process died?

> 
> --
> Eric Robinson
> 
> 
> ___
> Users mailing list: Users@clusterlabs.org 
> http://clusterlabs.org/mailman/listinfo/users 
> 
> Project Home: http://www.clusterlabs.org 
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
> Bugs: http://bugs.clusterlabs.org 





___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org