Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond

2017-05-31 Thread Ken Gaillot
On 05/24/2017 08:04 AM, Attila Megyeri wrote:
> Hi Klaus,
> 
> Thank you for your response.
> I tried many things, but no luck.
> 
> We have many pacemaker clusters with 99% identical configurations, package 
> versions, and only this one causes issues. (BTW we use unicast for corosync, 
> but this is the same for our other clusters as well.)
> I checked all connection settings between the nodes (to confirm there are no 
> firewall issues), increased the number of cores on each node, but still - as 
> long as a monitor operation is pending for a resource, no other operation is 
> executed.
> 
> e.g. resource A is being monitored, and timeout is 90 seconds, until this 
> check times out I cannot do a cleanup or start/stop on any other resource.

Do you have any constraints configured? If B depends on A, you probably
want at least an ordering constraint. Then the cluster would stop B
before stopping A, and not try to start it until A is up again.

Throttling based on load wasn't added until Pacemaker 1.1.11, so the
only limit on parallel execution in 1.1.10 was batch-limit, which
defaulted to 30 at the time.

I'd investigate by figuring out which node was DC at the time and
checking its pacemaker log (preferably with PCMK_debug=crmd turned on).
You can see each run of the policy engine and what decisions were made,
ending with a message like "saving inputs in
/var/lib/pacemaker/pengine/pe-input-4940.bz2". You can run crm_simulate
on that file to get more information about the decision-making process.

"crm_simulate -Sx $FILE -D transition.dot" will create a dot graph of
the transition showing dependencies. You can convert the graph to an svg
with "dot transition.dot -Tsvg > transition.svg" and then look at that
file in any SVG viewer (including most browsers).

> Two more interesting things: 
> - cluster recheck is set to 2 minutes, and even though the resources are 
> running properly, the fail counters are not reduced and crm_mon lists the 
> resources in failed actions section. forever. Or until I manually do resource 
> cleanup.
> - If i execute a crm resource cleanup RES_name from another node, sometimes 
> it simply does not clean up the failed state. If I execute this from the node 
> where the resource IS actually runing, the resource is removed from the 
> failed actions.
> 
> 
> What do you recommend, how could I start troubleshooting these issues? As I 
> said, this setup works fine in several other systems, but here I am 
> really-realy stuck.
> 
> 
> thanks!
> 
> Attila
> 
> 
> 
> 
> 
>> -Original Message-----
>> From: Klaus Wenninger [mailto:kwenn...@redhat.com]
>> Sent: Wednesday, May 10, 2017 2:04 PM
>> To: users@clusterlabs.org
>> Subject: Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond
>>
>> On 05/09/2017 10:34 PM, Attila Megyeri wrote:
>>>
>>> Actually I found some more details:
>>>
>>>
>>>
>>> there are two resources: A and B
>>>
>>>
>>>
>>> resource B depends on resource A (when the RA monitors B, if will fail
>>> if A is not running properly)
>>>
>>>
>>>
>>> If I stop resource A, the next monitor operation of "B" will fail.
>>> Interestingly, this check happens immediately after A is stopped.
>>>
>>>
>>>
>>> B is configured to restart if monitor fails. Start timeout is rather
>>> long, 180 seconds. So pacemaker tries to restart B, and waits.
>>>
>>>
>>>
>>> If I want to start "A", nothing happens until the start operation of
>>> "B" fails - typically several minutes.
>>>
>>>
>>>
>>>
>>>
>>> Is this the right behavior?
>>>
>>> It appears that pacemaker is blocked until resource B is being
>>> started, and I cannot really start its dependency...
>>>
>>> Shouldn't it be possible to start a resource while another resource is
>>> also starting?
>>>
>>
>> As long as resources don't depend on each other parallel starting should
>> work/happen.
>>
>> The number of parallel actions executed is derived from the number of
>> cores and
>> when load is detected some kind of throttling kicks in (in fact reduction of
>> the operations executed in parallel with the aim to reduce the load induced
>> by pacemaker). When throttling kicks in you should get log messages (there
>> is in fact a parallel discussion going on ...).
>> No idea if throttling might be a reason here but maybe worth considering
>> at least.
>>
>> Another

Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond

2017-05-24 Thread Attila Megyeri
Hi Klaus,

Thank you for your response.
I tried many things, but no luck.

We have many pacemaker clusters with 99% identical configurations, package 
versions, and only this one causes issues. (BTW we use unicast for corosync, 
but this is the same for our other clusters as well.)
I checked all connection settings between the nodes (to confirm there are no 
firewall issues), increased the number of cores on each node, but still - as 
long as a monitor operation is pending for a resource, no other operation is 
executed.

e.g. resource A is being monitored, and timeout is 90 seconds, until this check 
times out I cannot do a cleanup or start/stop on any other resource.

Two more interesting things: 
- cluster recheck is set to 2 minutes, and even though the resources are 
running properly, the fail counters are not reduced and crm_mon lists the 
resources in failed actions section. forever. Or until I manually do resource 
cleanup.
- If i execute a crm resource cleanup RES_name from another node, sometimes it 
simply does not clean up the failed state. If I execute this from the node 
where the resource IS actually runing, the resource is removed from the failed 
actions.


What do you recommend, how could I start troubleshooting these issues? As I 
said, this setup works fine in several other systems, but here I am 
really-realy stuck.


thanks!

Attila





> -Original Message-
> From: Klaus Wenninger [mailto:kwenn...@redhat.com]
> Sent: Wednesday, May 10, 2017 2:04 PM
> To: users@clusterlabs.org
> Subject: Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond
> 
> On 05/09/2017 10:34 PM, Attila Megyeri wrote:
> >
> > Actually I found some more details:
> >
> >
> >
> > there are two resources: A and B
> >
> >
> >
> > resource B depends on resource A (when the RA monitors B, if will fail
> > if A is not running properly)
> >
> >
> >
> > If I stop resource A, the next monitor operation of "B" will fail.
> > Interestingly, this check happens immediately after A is stopped.
> >
> >
> >
> > B is configured to restart if monitor fails. Start timeout is rather
> > long, 180 seconds. So pacemaker tries to restart B, and waits.
> >
> >
> >
> > If I want to start "A", nothing happens until the start operation of
> > "B" fails - typically several minutes.
> >
> >
> >
> >
> >
> > Is this the right behavior?
> >
> > It appears that pacemaker is blocked until resource B is being
> > started, and I cannot really start its dependency...
> >
> > Shouldn't it be possible to start a resource while another resource is
> > also starting?
> >
> 
> As long as resources don't depend on each other parallel starting should
> work/happen.
> 
> The number of parallel actions executed is derived from the number of
> cores and
> when load is detected some kind of throttling kicks in (in fact reduction of
> the operations executed in parallel with the aim to reduce the load induced
> by pacemaker). When throttling kicks in you should get log messages (there
> is in fact a parallel discussion going on ...).
> No idea if throttling might be a reason here but maybe worth considering
> at least.
> 
> Another reason why certain things happen with quite some delay I've
> observed
> is that obviously some situations are just resolved when the
> cluster-recheck-interval
> triggers a pengine run in addition to those triggered by changes.
> You might easily verify this by changing the cluster-recheck-interval.
> 
> Regards,
> Klaus
> 
> >
> >
> >
> >
> > Thanks,
> >
> > Attila
> >
> >
> >
> >
> >
> > *From:*Attila Megyeri [mailto:amegy...@minerva-soft.com]
> > *Sent:* Tuesday, May 9, 2017 9:53 PM
> > *To:* users@clusterlabs.org; kgail...@redhat.com
> > *Subject:* [ClusterLabs] Pacemaker occasionally takes minutes to respond
> >
> >
> >
> > Hi Ken, all,
> >
> >
> >
> >
> >
> > We ran into an issue very similar to the one described in
> > https://bugzilla.redhat.com/show_bug.cgi?id=1430112 /  [Intel 7.4 Bug]
> > Pacemaker occasionally takes minutes to respond
> >
> >
> >
> > But  in our case we are not using fencing/stonith at all.
> >
> >
> >
> > Many times when I want to start/stop/cleanup a resource, it takes tens
> > of seconds (or even minutes) till the command gets executed. The logs
> > show nothing in that period, the redundant rings show no fault.
> >
> >
> >
> > Could this be the same issue?
> >
> &g

Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond

2017-05-10 Thread Klaus Wenninger
On 05/09/2017 10:34 PM, Attila Megyeri wrote:
>
> Actually I found some more details:
>
>  
>
> there are two resources: A and B
>
>  
>
> resource B depends on resource A (when the RA monitors B, if will fail
> if A is not running properly)
>
>  
>
> If I stop resource A, the next monitor operation of „B” will fail.
> Interestingly, this check happens immediately after A is stopped.
>
>  
>
> B is configured to restart if monitor fails. Start timeout is rather
> long, 180 seconds. So pacemaker tries to restart B, and waits.
>
>  
>
> If I want to start „A”, nothing happens until the start operation of
> „B” fails – typically several minutes.
>
>  
>
>  
>
> Is this the right behavior?
>
> It appears that pacemaker is blocked until resource B is being
> started, and I cannot really start its dependency…
>
> Shouldn’t it be possible to start a resource while another resource is
> also starting?
>

As long as resources don't depend on each other parallel starting should
work/happen.

The number of parallel actions executed is derived from the number of
cores and
when load is detected some kind of throttling kicks in (in fact reduction of
the operations executed in parallel with the aim to reduce the load induced
by pacemaker). When throttling kicks in you should get log messages (there
is in fact a parallel discussion going on ...).
No idea if throttling might be a reason here but maybe worth considering
at least.

Another reason why certain things happen with quite some delay I've observed
is that obviously some situations are just resolved when the
cluster-recheck-interval
triggers a pengine run in addition to those triggered by changes.
You might easily verify this by changing the cluster-recheck-interval.

Regards,
Klaus

>  
>
>  
>
> Thanks,
>
> Attila
>
>  
>
>  
>
> *From:*Attila Megyeri [mailto:amegy...@minerva-soft.com]
> *Sent:* Tuesday, May 9, 2017 9:53 PM
> *To:* users@clusterlabs.org; kgail...@redhat.com
> *Subject:* [ClusterLabs] Pacemaker occasionally takes minutes to respond
>
>  
>
> Hi Ken, all,
>
>  
>
>  
>
> We ran into an issue very similar to the one described in
> https://bugzilla.redhat.com/show_bug.cgi?id=1430112 /  [Intel 7.4 Bug]
> Pacemaker occasionally takes minutes to respond
>
>  
>
> But  in our case we are not using fencing/stonith at all.
>
>  
>
> Many times when I want to start/stop/cleanup a resource, it takes tens
> of seconds (or even minutes) till the command gets executed. The logs
> show nothing in that period, the redundant rings show no fault.
>
>  
>
> Could this be the same issue?
>
>  
>
> Any hints on how to troubleshoot this?
>
> It is  pacemaker 1.1.10, corosync 2.3.3
>
>  
>
>  
>
> Cheers,
>
> Attila
>
>  
>
>  
>
>  
>
>
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://lists.clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


-- 
Klaus Wenninger

Senior Software Engineer, EMEA ENG Openstack Infrastructure

Red Hat

kwenn...@redhat.com   


___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond

2017-05-09 Thread Attila Megyeri
Actually I found some more details:

there are two resources: A and B

resource B depends on resource A (when the RA monitors B, if will fail if A is 
not running properly)

If I stop resource A, the next monitor operation of "B" will fail. 
Interestingly, this check happens immediately after A is stopped.

B is configured to restart if monitor fails. Start timeout is rather long, 180 
seconds. So pacemaker tries to restart B, and waits.

If I want to start "A", nothing happens until the start operation of "B" fails 
- typically several minutes.


Is this the right behavior?
It appears that pacemaker is blocked until resource B is being started, and I 
cannot really start its dependency...
Shouldn't it be possible to start a resource while another resource is also 
starting?


Thanks,
Attila


From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
Sent: Tuesday, May 9, 2017 9:53 PM
To: users@clusterlabs.org; kgail...@redhat.com
Subject: [ClusterLabs] Pacemaker occasionally takes minutes to respond

Hi Ken, all,


We ran into an issue very similar to the one described in 
https://bugzilla.redhat.com/show_bug.cgi?id=1430112 /  [Intel 7.4 Bug] 
Pacemaker occasionally takes minutes to respond

But  in our case we are not using fencing/stonith at all.

Many times when I want to start/stop/cleanup a resource, it takes tens of 
seconds (or even minutes) till the command gets executed. The logs show nothing 
in that period, the redundant rings show no fault.

Could this be the same issue?

Any hints on how to troubleshoot this?
It is  pacemaker 1.1.10, corosync 2.3.3


Cheers,
Attila



___
Users mailing list: Users@clusterlabs.org
http://lists.clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org