Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond
On 05/24/2017 08:04 AM, Attila Megyeri wrote: > Hi Klaus, > > Thank you for your response. > I tried many things, but no luck. > > We have many pacemaker clusters with 99% identical configurations, package > versions, and only this one causes issues. (BTW we use unicast for corosync, > but this is the same for our other clusters as well.) > I checked all connection settings between the nodes (to confirm there are no > firewall issues), increased the number of cores on each node, but still - as > long as a monitor operation is pending for a resource, no other operation is > executed. > > e.g. resource A is being monitored, and timeout is 90 seconds, until this > check times out I cannot do a cleanup or start/stop on any other resource. Do you have any constraints configured? If B depends on A, you probably want at least an ordering constraint. Then the cluster would stop B before stopping A, and not try to start it until A is up again. Throttling based on load wasn't added until Pacemaker 1.1.11, so the only limit on parallel execution in 1.1.10 was batch-limit, which defaulted to 30 at the time. I'd investigate by figuring out which node was DC at the time and checking its pacemaker log (preferably with PCMK_debug=crmd turned on). You can see each run of the policy engine and what decisions were made, ending with a message like "saving inputs in /var/lib/pacemaker/pengine/pe-input-4940.bz2". You can run crm_simulate on that file to get more information about the decision-making process. "crm_simulate -Sx $FILE -D transition.dot" will create a dot graph of the transition showing dependencies. You can convert the graph to an svg with "dot transition.dot -Tsvg > transition.svg" and then look at that file in any SVG viewer (including most browsers). > Two more interesting things: > - cluster recheck is set to 2 minutes, and even though the resources are > running properly, the fail counters are not reduced and crm_mon lists the > resources in failed actions section. forever. Or until I manually do resource > cleanup. > - If i execute a crm resource cleanup RES_name from another node, sometimes > it simply does not clean up the failed state. If I execute this from the node > where the resource IS actually runing, the resource is removed from the > failed actions. > > > What do you recommend, how could I start troubleshooting these issues? As I > said, this setup works fine in several other systems, but here I am > really-realy stuck. > > > thanks! > > Attila > > > > > >> -Original Message----- >> From: Klaus Wenninger [mailto:kwenn...@redhat.com] >> Sent: Wednesday, May 10, 2017 2:04 PM >> To: users@clusterlabs.org >> Subject: Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond >> >> On 05/09/2017 10:34 PM, Attila Megyeri wrote: >>> >>> Actually I found some more details: >>> >>> >>> >>> there are two resources: A and B >>> >>> >>> >>> resource B depends on resource A (when the RA monitors B, if will fail >>> if A is not running properly) >>> >>> >>> >>> If I stop resource A, the next monitor operation of "B" will fail. >>> Interestingly, this check happens immediately after A is stopped. >>> >>> >>> >>> B is configured to restart if monitor fails. Start timeout is rather >>> long, 180 seconds. So pacemaker tries to restart B, and waits. >>> >>> >>> >>> If I want to start "A", nothing happens until the start operation of >>> "B" fails - typically several minutes. >>> >>> >>> >>> >>> >>> Is this the right behavior? >>> >>> It appears that pacemaker is blocked until resource B is being >>> started, and I cannot really start its dependency... >>> >>> Shouldn't it be possible to start a resource while another resource is >>> also starting? >>> >> >> As long as resources don't depend on each other parallel starting should >> work/happen. >> >> The number of parallel actions executed is derived from the number of >> cores and >> when load is detected some kind of throttling kicks in (in fact reduction of >> the operations executed in parallel with the aim to reduce the load induced >> by pacemaker). When throttling kicks in you should get log messages (there >> is in fact a parallel discussion going on ...). >> No idea if throttling might be a reason here but maybe worth considering >> at least. >> >> Another
Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond
Hi Klaus, Thank you for your response. I tried many things, but no luck. We have many pacemaker clusters with 99% identical configurations, package versions, and only this one causes issues. (BTW we use unicast for corosync, but this is the same for our other clusters as well.) I checked all connection settings between the nodes (to confirm there are no firewall issues), increased the number of cores on each node, but still - as long as a monitor operation is pending for a resource, no other operation is executed. e.g. resource A is being monitored, and timeout is 90 seconds, until this check times out I cannot do a cleanup or start/stop on any other resource. Two more interesting things: - cluster recheck is set to 2 minutes, and even though the resources are running properly, the fail counters are not reduced and crm_mon lists the resources in failed actions section. forever. Or until I manually do resource cleanup. - If i execute a crm resource cleanup RES_name from another node, sometimes it simply does not clean up the failed state. If I execute this from the node where the resource IS actually runing, the resource is removed from the failed actions. What do you recommend, how could I start troubleshooting these issues? As I said, this setup works fine in several other systems, but here I am really-realy stuck. thanks! Attila > -Original Message- > From: Klaus Wenninger [mailto:kwenn...@redhat.com] > Sent: Wednesday, May 10, 2017 2:04 PM > To: users@clusterlabs.org > Subject: Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond > > On 05/09/2017 10:34 PM, Attila Megyeri wrote: > > > > Actually I found some more details: > > > > > > > > there are two resources: A and B > > > > > > > > resource B depends on resource A (when the RA monitors B, if will fail > > if A is not running properly) > > > > > > > > If I stop resource A, the next monitor operation of "B" will fail. > > Interestingly, this check happens immediately after A is stopped. > > > > > > > > B is configured to restart if monitor fails. Start timeout is rather > > long, 180 seconds. So pacemaker tries to restart B, and waits. > > > > > > > > If I want to start "A", nothing happens until the start operation of > > "B" fails - typically several minutes. > > > > > > > > > > > > Is this the right behavior? > > > > It appears that pacemaker is blocked until resource B is being > > started, and I cannot really start its dependency... > > > > Shouldn't it be possible to start a resource while another resource is > > also starting? > > > > As long as resources don't depend on each other parallel starting should > work/happen. > > The number of parallel actions executed is derived from the number of > cores and > when load is detected some kind of throttling kicks in (in fact reduction of > the operations executed in parallel with the aim to reduce the load induced > by pacemaker). When throttling kicks in you should get log messages (there > is in fact a parallel discussion going on ...). > No idea if throttling might be a reason here but maybe worth considering > at least. > > Another reason why certain things happen with quite some delay I've > observed > is that obviously some situations are just resolved when the > cluster-recheck-interval > triggers a pengine run in addition to those triggered by changes. > You might easily verify this by changing the cluster-recheck-interval. > > Regards, > Klaus > > > > > > > > > > > Thanks, > > > > Attila > > > > > > > > > > > > *From:*Attila Megyeri [mailto:amegy...@minerva-soft.com] > > *Sent:* Tuesday, May 9, 2017 9:53 PM > > *To:* users@clusterlabs.org; kgail...@redhat.com > > *Subject:* [ClusterLabs] Pacemaker occasionally takes minutes to respond > > > > > > > > Hi Ken, all, > > > > > > > > > > > > We ran into an issue very similar to the one described in > > https://bugzilla.redhat.com/show_bug.cgi?id=1430112 / [Intel 7.4 Bug] > > Pacemaker occasionally takes minutes to respond > > > > > > > > But in our case we are not using fencing/stonith at all. > > > > > > > > Many times when I want to start/stop/cleanup a resource, it takes tens > > of seconds (or even minutes) till the command gets executed. The logs > > show nothing in that period, the redundant rings show no fault. > > > > > > > > Could this be the same issue? > > > &g
Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond
On 05/09/2017 10:34 PM, Attila Megyeri wrote: > > Actually I found some more details: > > > > there are two resources: A and B > > > > resource B depends on resource A (when the RA monitors B, if will fail > if A is not running properly) > > > > If I stop resource A, the next monitor operation of „B” will fail. > Interestingly, this check happens immediately after A is stopped. > > > > B is configured to restart if monitor fails. Start timeout is rather > long, 180 seconds. So pacemaker tries to restart B, and waits. > > > > If I want to start „A”, nothing happens until the start operation of > „B” fails – typically several minutes. > > > > > > Is this the right behavior? > > It appears that pacemaker is blocked until resource B is being > started, and I cannot really start its dependency… > > Shouldn’t it be possible to start a resource while another resource is > also starting? > As long as resources don't depend on each other parallel starting should work/happen. The number of parallel actions executed is derived from the number of cores and when load is detected some kind of throttling kicks in (in fact reduction of the operations executed in parallel with the aim to reduce the load induced by pacemaker). When throttling kicks in you should get log messages (there is in fact a parallel discussion going on ...). No idea if throttling might be a reason here but maybe worth considering at least. Another reason why certain things happen with quite some delay I've observed is that obviously some situations are just resolved when the cluster-recheck-interval triggers a pengine run in addition to those triggered by changes. You might easily verify this by changing the cluster-recheck-interval. Regards, Klaus > > > > > Thanks, > > Attila > > > > > > *From:*Attila Megyeri [mailto:amegy...@minerva-soft.com] > *Sent:* Tuesday, May 9, 2017 9:53 PM > *To:* users@clusterlabs.org; kgail...@redhat.com > *Subject:* [ClusterLabs] Pacemaker occasionally takes minutes to respond > > > > Hi Ken, all, > > > > > > We ran into an issue very similar to the one described in > https://bugzilla.redhat.com/show_bug.cgi?id=1430112 / [Intel 7.4 Bug] > Pacemaker occasionally takes minutes to respond > > > > But in our case we are not using fencing/stonith at all. > > > > Many times when I want to start/stop/cleanup a resource, it takes tens > of seconds (or even minutes) till the command gets executed. The logs > show nothing in that period, the redundant rings show no fault. > > > > Could this be the same issue? > > > > Any hints on how to troubleshoot this? > > It is pacemaker 1.1.10, corosync 2.3.3 > > > > > > Cheers, > > Attila > > > > > > > > > > ___ > Users mailing list: Users@clusterlabs.org > http://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org -- Klaus Wenninger Senior Software Engineer, EMEA ENG Openstack Infrastructure Red Hat kwenn...@redhat.com ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker occasionally takes minutes to respond
Actually I found some more details: there are two resources: A and B resource B depends on resource A (when the RA monitors B, if will fail if A is not running properly) If I stop resource A, the next monitor operation of "B" will fail. Interestingly, this check happens immediately after A is stopped. B is configured to restart if monitor fails. Start timeout is rather long, 180 seconds. So pacemaker tries to restart B, and waits. If I want to start "A", nothing happens until the start operation of "B" fails - typically several minutes. Is this the right behavior? It appears that pacemaker is blocked until resource B is being started, and I cannot really start its dependency... Shouldn't it be possible to start a resource while another resource is also starting? Thanks, Attila From: Attila Megyeri [mailto:amegy...@minerva-soft.com] Sent: Tuesday, May 9, 2017 9:53 PM To: users@clusterlabs.org; kgail...@redhat.com Subject: [ClusterLabs] Pacemaker occasionally takes minutes to respond Hi Ken, all, We ran into an issue very similar to the one described in https://bugzilla.redhat.com/show_bug.cgi?id=1430112 / [Intel 7.4 Bug] Pacemaker occasionally takes minutes to respond But in our case we are not using fencing/stonith at all. Many times when I want to start/stop/cleanup a resource, it takes tens of seconds (or even minutes) till the command gets executed. The logs show nothing in that period, the redundant rings show no fault. Could this be the same issue? Any hints on how to troubleshoot this? It is pacemaker 1.1.10, corosync 2.3.3 Cheers, Attila ___ Users mailing list: Users@clusterlabs.org http://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org