Re: [ClusterLabs] Antw: Behavior after stop action failure with the failure-timeout set and STONITH disabled

2017-05-08 Thread Ken Gaillot
On 05/05/2017 07:49 AM, Jan Wrona wrote:
> On 5.5.2017 08:15, Ulrich Windl wrote:
> Jan Wrona  schrieb am 04.05.2017 um 16:41 in
> Nachricht
>> :
>>> I hope I'll be able to explain the problem clearly and correctly.
>>>
>>> My setup (simplified): I have two cloned resources, a filesystem mount
>>> and a process which writes to that filesystem. The filesystem is Gluster
>>> so its OK to clone it. I also have a mandatory ordering constraint
>>> "start gluster-mount-clone then start writer-process-clone". I don't
>>> have a STONITH device, so I've disable STONITH by setting
>>> stonith-enabled=false.
>>>
>>> The problem: Sometimes the Gluster freezes for a while, which causes the
>>> gluster-mount resource's monitor with the OCF_CHECK_LEVEL=20 to timeout
>>> (it is unable to write the status file). When this happens, the cluster

Have you tried increasing the monitor timeouts?

>> Actually I would do two things:
>>
>> 1) Find out why Gluster freezes, and what to do to avoid that
> 
> It freezes when one of the underlying MD RAIDs starts its regular check.
> I've decreased its speed limit (from the default 200 MB/s to the 50
> MB/s, I cannot go any lower), but it helped only a little, the mount
> still tends to freeze for a few seconds during the check.
> 
>>
>> 2) Implement stonith
> 
> Currently I can't. But AFAIK Pacemaker should work properly even with
> disabled STONITH and the state I've run into doesn't seem right to me at
> all. I was asking for clarification of what the cluster is trying to do
> in such situation, I don't understand the "Ignoring expired calculated
> failure" log messages and I don't understand why the crm_mon was showing
> that the writer-process is started even though it was not.

Pacemaker can work without stonith, but there are certain failure
situations that can't be recovered any other way, so whether that's
working "properly" is a matter of opinion. :-) In this particular case,
stonith doesn't make the situation much better -- you want to prevent
the need for stonith to begin with (hopefully increasing the monitor
timeouts is sufficient). But stonith is still good to have for other
situations.

The cluster shows the service as started because it determines the state
by the service's operation history:

   successful start at time A = started
   successful start at time A + failed stop at time B = started (failed)
   after failure expires, back to: successful start at time A = started

If the service is not actually running at that point, the next recurring
monitor should detect that.

>> Regards,
>> Ulrich
>>
>>
>>> tries to recover by restarting the writer-process resource. But the
>>> writer-process is writing to the frozen filesystem which makes it
>>> uninterruptable, not even SIGKILL works. Then the stop operation times
>>> out and on-fail with disabled STONITH defaults to block (don’t perform
>>> any further operations on the resource):
>>> warning: Forcing writer-process-clone away from node1.example.org after
>>> 100 failures (max=100)
>>> After that, the cluster continues with the recovery process by
>>> restarting the gluster-mount resource on that node and it usually
>>> succeeds. As a consequence of that remount, the uninterruptable system
>>> call in the writer process fails, signals are finally delivered and the
>>> writer-process is terminated. But the cluster doesn't know about that!
>>>
>>> I thought I can solve this by setting the failure-timeout meta attribute
>>> to the writer-process resource, but it only made things worse. The
>>> documentation states: "Stop failures are slightly different and crucial.
>>> ... If a resource fails to stop and STONITH is not enabled, then the
>>> cluster has no way to continue and will not try to start the resource
>>> elsewhere, but will try to stop it again after the failure timeout.",

The documentation is silently making the assumption that the condition
that led to the initial stop is still true. In this case, if the gluster
failure has long since been cleaned up, there is no reason to try to
stop the writer-process.

>>> but I'm seeing something different. When the policy engine is launched
>>> after the nearest cluster-recheck-interval, following lines are written
>>> to the syslog:
>>> crmd[11852]: notice: State transition S_IDLE -> S_POLICY_ENGINE
>>> pengine[11851]:  notice: Clearing expired failcount for writer-process:1
>>> on node1.example.org
>>> pengine[11851]:  notice: Clearing expired failcount for writer-process:1
>>> on node1.example.org
>>> pengine[11851]:  notice: Ignoring expired calculated failure
>>> writer-process_stop_0 (rc=1,
>>> magic=2:1;64:557:0:2169780b-ca1f-483e-ad42-118b7c7c1a7d) on
>>> node1.example.org
>>> pengine[11851]:  notice: Clearing expired failcount for writer-process:1
>>> on node1.example.org
>>> pengine[11851]:  notice: Ignoring expired calculated failure
>>> writer-process_stop_0 (rc=1,
>>> magic=2:1;64:557:0:2169780b-ca1f-483e-ad42-118b7c7c1a7d) on
>>> node1

Re: [ClusterLabs] Antw: Behavior after stop action failure with the failure-timeout set and STONITH disabled

2017-05-05 Thread Dimitri Maziuk
On 05/05/2017 07:49 AM, Jan Wrona wrote:

> But AFAIK Pacemaker should work properly even with
> disabled STONITH and the state I've run into doesn't seem right to me at
> all.

The keyword here is *should*. In my case the kernel (according to
lsof/fuser) keeps an open file descriptor on the DRBD device, and the
power button is the only way to "unfreeze" it.

Hack the RA to write the status file somewhere else perhaps?

-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Antw: Behavior after stop action failure with the failure-timeout set and STONITH disabled

2017-05-05 Thread Jan Wrona

On 5.5.2017 08:15, Ulrich Windl wrote:

Jan Wrona  schrieb am 04.05.2017 um 16:41 in Nachricht

:

I hope I'll be able to explain the problem clearly and correctly.

My setup (simplified): I have two cloned resources, a filesystem mount
and a process which writes to that filesystem. The filesystem is Gluster
so its OK to clone it. I also have a mandatory ordering constraint
"start gluster-mount-clone then start writer-process-clone". I don't
have a STONITH device, so I've disable STONITH by setting
stonith-enabled=false.

The problem: Sometimes the Gluster freezes for a while, which causes the
gluster-mount resource's monitor with the OCF_CHECK_LEVEL=20 to timeout
(it is unable to write the status file). When this happens, the cluster

Actually I would do two things:

1) Find out why Gluster freezes, and what to do to avoid that


It freezes when one of the underlying MD RAIDs starts its regular check. 
I've decreased its speed limit (from the default 200 MB/s to the 50 
MB/s, I cannot go any lower), but it helped only a little, the mount 
still tends to freeze for a few seconds during the check.




2) Implement stonith


Currently I can't. But AFAIK Pacemaker should work properly even with 
disabled STONITH and the state I've run into doesn't seem right to me at 
all. I was asking for clarification of what the cluster is trying to do 
in such situation, I don't understand the "Ignoring expired calculated 
failure" log messages and I don't understand why the crm_mon was showing 
that the writer-process is started even though it was not.




Regards,
Ulrich



tries to recover by restarting the writer-process resource. But the
writer-process is writing to the frozen filesystem which makes it
uninterruptable, not even SIGKILL works. Then the stop operation times
out and on-fail with disabled STONITH defaults to block (don’t perform
any further operations on the resource):
warning: Forcing writer-process-clone away from node1.example.org after
100 failures (max=100)
After that, the cluster continues with the recovery process by
restarting the gluster-mount resource on that node and it usually
succeeds. As a consequence of that remount, the uninterruptable system
call in the writer process fails, signals are finally delivered and the
writer-process is terminated. But the cluster doesn't know about that!

I thought I can solve this by setting the failure-timeout meta attribute
to the writer-process resource, but it only made things worse. The
documentation states: "Stop failures are slightly different and crucial.
... If a resource fails to stop and STONITH is not enabled, then the
cluster has no way to continue and will not try to start the resource
elsewhere, but will try to stop it again after the failure timeout.",
but I'm seeing something different. When the policy engine is launched
after the nearest cluster-recheck-interval, following lines are written
to the syslog:
crmd[11852]: notice: State transition S_IDLE -> S_POLICY_ENGINE
pengine[11851]:  notice: Clearing expired failcount for writer-process:1
on node1.example.org
pengine[11851]:  notice: Clearing expired failcount for writer-process:1
on node1.example.org
pengine[11851]:  notice: Ignoring expired calculated failure
writer-process_stop_0 (rc=1,
magic=2:1;64:557:0:2169780b-ca1f-483e-ad42-118b7c7c1a7d) on
node1.example.org
pengine[11851]:  notice: Clearing expired failcount for writer-process:1
on node1.example.org
pengine[11851]:  notice: Ignoring expired calculated failure
writer-process_stop_0 (rc=1,
magic=2:1;64:557:0:2169780b-ca1f-483e-ad42-118b7c7c1a7d) on
node1.example.org
pengine[11851]: warning: Processing failed op monitor for
gluster-mount:1 on node1.example.org: unknown error (1)
pengine[11851]:  notice: Calculated transition 564, saving inputs in
/var/lib/pacemaker/pengine/pe-input-362.bz2
crmd[11852]:  notice: Transition 564 (Complete=2, Pending=0, Fired=0,
Skipped=0, Incomplete=0,
Source=/var/lib/pacemaker/pengine/pe-input-362.bz2): Complete
crmd[11852]:  notice: State transition S_TRANSITION_ENGINE -> S_IDLE
crmd[11852]:  notice: State transition S_IDLE -> S_POLICY_ENGINE
crmd[11852]: warning: No reason to expect node 3 to be down
crmd[11852]: warning: No reason to expect node 1 to be down
crmd[11852]: warning: No reason to expect node 1 to be down
crmd[11852]: warning: No reason to expect node 3 to be down
pengine[11851]: warning: Processing failed op stop for writer-process:1
on node1.example.org: unknown error (1)
pengine[11851]: warning: Processing failed op monitor for
gluster-mount:1 on node1.example.org: unknown error (1)
pengine[11851]: warning: Forcing writer-process-clone away from
node1.example.org after 100 failures (max=100)
pengine[11851]: warning: Forcing writer-process-clone away from
node1.example.org after 100 failures (max=100)
pengine[11851]: warning: Forcing writer-process-clone away from
node1.example.org after 100 failures (max=100)
pengine[11851]:  notice: Calculated tran