Re: [ClusterLabs] [corosync][Problem] Very long "pause detect ... " was detected.

2016-06-12 Thread Jan Friesse

Hideo,


Hi All,

Our user constituted a cluster in corosync and Pacemaker in the next 
environment.
The cluster constituted it among guests.

* Host/Guest : RHEL6.6 - kernel : 2.6.32-504.el6.x86_64
* libqb 0.17.1
* corosync 2.3.4
* Pacemaker 1.1.12

The cluster worked well.
When a user stopped an active guest, the next log was output in standby guests 
repeatedly.


What exactly you mean by "active guest" and "standby guests"?



May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5515870 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5515920 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5515971 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516021 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516071 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516121 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516171 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516221 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516271 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516322 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516372 ms, flushing membership messages.
(snip)
May xx xx:26:03 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5526172 ms, flushing membership messages.
May xx xx:26:03 standby-guest corosync[6311]:  [MAIN  ] Totem is unable to form 
a cluster because of an operating system or network fault. The most common 
cause of this message is that the local firewall is configured improperly.
May xx xx:26:03 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5526222 ms, flushing membership messages.
(snip)



This is weird. Not because of enormous pause length but because corosync 
has a "scheduler pause" detector which warns before "Process pause 
detected ..." error is logged.



As a result, the standby guest failed in the construction of the independent 
cluster.

It is recorded in log as if a timer stopped for 91 minutes.
It is abnormal length for 91 minutes.

Did you see a similar problem?


Never



Possibly I think whether it is libqb or Kernel or some kind of problems.


What virtualization technology are you using? KVM?


* I suspect that the set of the timer failed in reset_pause_timeout().


You can try to put asserts into this function, but there is really not 
too much reasons why it should fail (ether malloc returns NULL or some 
nasty memory corruption).


Regards,
  Honza



Best Regards,
Hideo Yamauchi.


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker 1.1.15 - Release Candidate 4

2016-06-12 Thread Klaus Wenninger
Sorry, I hadn't seen that you had replied (don't know why maybe I just
had derived that from the time ;-) ).
But I guess what I've written doesn't really contradict your answer ...
Although it might make look us a little uncoordinated - anyway sorry
again...

Klaus

On 06/12/2016 04:50 PM, Ken Gaillot wrote:
> On 06/12/2016 07:28 AM, Ferenc Wágner wrote:
>> Ken Gaillot  writes:
>>
>>> With this release candidate, we now provide three sample alert scripts
>>> to use with the new alerts feature, installed in the
>>> /usr/share/pacemaker/alerts directory.
>> Hi,
>>
>> Is there a real reason to name these scripts *.sample?  Sure, they are
>> samples, but they are also usable as-is, aren't they?
> Almost as-is -- copy them somewhere, rename them without ".sample", and
> mark them executable.
>
> After some discussion, we decided that this feature is not mature enough
> yet to provide the scripts for direct use. After we get some experience
> with how users actually use the feature and the sample scripts, we can
> gain more confidence in recommending them generally. Until then, we
> recommend that people examine the script source and edit it to suit
> their needs before using it.
>
> That said, I think the SNMP script in particular is quite useful.
>
> The log-to-file script is more a proof-of-concept that people can use as
> a template. The SMTP script may be useful, but probably paired with some
> custom software handling the recipient address, to avoid flooding a real
> person's mailbox when a cluster is active.
>
>>> The ./configure script has a new "--with-configdir" option.
>> This greatly simplifies packaging, thanks much!
>>
>> Speaking about packaging: are the alert scripts run by remote Pacemaker
>> nodes?  I couldn't find described which nodes run the alert scripts.
>> From the mailing list discussions I recall they are run by each node,
>> but this would be useful to spell out in the documentation, I think.
> Good point. Alert scripts are run only on cluster nodes, but they
> include remote node events. I'll make sure the documentation mentions that.
>
>> Similarly for the alert guarrantees: I recall there's no such thing, but
>> one could also think they are parts of transactions, thus having recovery
>> behavior similar to the resource operations.  Hmm... wouldn't such
>> design actually make sense?
> We didn't want to make any cluster operation depend on alert script
> success. The only thing we can guarantee is that the cluster will try to
> call the alert script for each event. But if the system is going
> haywire, for example, we may be unable to spawn a new process due to
> some resource exhaustion, and of course the script itself may have problems.
>
> Also, we wanted to minimize the script interface, and keep it
> backward-compatible with crm_mon external scripts. We didn't want to add
> an OCF-style layer of meta-data, actions and return codes, instead
> keeping it as simple as possible for anyone writing one.
>
> Since it's a brand new feature, we definitely want feedback on all
> aspects once it's in actual use. If alert script failures turns out to
> be a big issue, I could see maybe reporting them in cluster status (and
> allowing that to be cleaned up).
>
> ___
> Users mailing list: Users@clusterlabs.org
> http://clusterlabs.org/mailman/listinfo/users
>
> Project Home: http://www.clusterlabs.org
> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
> Bugs: http://bugs.clusterlabs.org


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker 1.1.15 - Release Candidate 4

2016-06-12 Thread Klaus Wenninger
On 06/12/2016 02:28 PM, Ferenc Wágner wrote:
> Ken Gaillot  writes:
>
>> With this release candidate, we now provide three sample alert scripts
>> to use with the new alerts feature, installed in the
>> /usr/share/pacemaker/alerts directory.
> Hi,
>
> Is there a real reason to name these scripts *.sample?  Sure, they are
> samples, but they are also usable as-is, aren't they?
Focus is to show how the alert-feature can be used and not
smooth and generic usability.
There might be use-cases where they might be useful as is
but users are still encouraged to have a look at the code and
modify it as needed.
As the interface stabilizes it might turn out useful to spawn
them out like fence- & resource-agents.
>> The ./configure script has a new "--with-configdir" option.
> This greatly simplifies packaging, thanks much!
>
> Speaking about packaging: are the alert scripts run by remote Pacemaker
> nodes?  I couldn't find described which nodes run the alert scripts.
> From the mailing list discussions I recall they are run by each node,
> but this would be useful to spell out in the documentation, I think.
The alert-scripts are being called by crmd. So no alert-scripts on
remote-nodes...
Had planned to add a few lines about environment and conditions
alert-scripts are being run so that sbdy writing one can get a
feeling what to expects and be careful with.
>
> Similarly for the alert guarrantees: I recall there's no such thing, but
> one could also think they are parts of transactions, thus having recovery
> behavior similar to the resource operations.  Hmm... wouldn't such
> design actually make sense?
One of the goals of the design was to make it as simple as possible
and to depend on as little as possible. (e.g. spawn out all alerts
in separate processes in parallel as they occur - not 100% fire
and forget but quite close ...)


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker 1.1.15 - Release Candidate 4

2016-06-12 Thread Ken Gaillot
On 06/12/2016 07:28 AM, Ferenc Wágner wrote:
> Ken Gaillot  writes:
> 
>> With this release candidate, we now provide three sample alert scripts
>> to use with the new alerts feature, installed in the
>> /usr/share/pacemaker/alerts directory.
> 
> Hi,
> 
> Is there a real reason to name these scripts *.sample?  Sure, they are
> samples, but they are also usable as-is, aren't they?

Almost as-is -- copy them somewhere, rename them without ".sample", and
mark them executable.

After some discussion, we decided that this feature is not mature enough
yet to provide the scripts for direct use. After we get some experience
with how users actually use the feature and the sample scripts, we can
gain more confidence in recommending them generally. Until then, we
recommend that people examine the script source and edit it to suit
their needs before using it.

That said, I think the SNMP script in particular is quite useful.

The log-to-file script is more a proof-of-concept that people can use as
a template. The SMTP script may be useful, but probably paired with some
custom software handling the recipient address, to avoid flooding a real
person's mailbox when a cluster is active.

>> The ./configure script has a new "--with-configdir" option.
> 
> This greatly simplifies packaging, thanks much!
> 
> Speaking about packaging: are the alert scripts run by remote Pacemaker
> nodes?  I couldn't find described which nodes run the alert scripts.
> From the mailing list discussions I recall they are run by each node,
> but this would be useful to spell out in the documentation, I think.

Good point. Alert scripts are run only on cluster nodes, but they
include remote node events. I'll make sure the documentation mentions that.

> Similarly for the alert guarrantees: I recall there's no such thing, but
> one could also think they are parts of transactions, thus having recovery
> behavior similar to the resource operations.  Hmm... wouldn't such
> design actually make sense?

We didn't want to make any cluster operation depend on alert script
success. The only thing we can guarantee is that the cluster will try to
call the alert script for each event. But if the system is going
haywire, for example, we may be unable to spawn a new process due to
some resource exhaustion, and of course the script itself may have problems.

Also, we wanted to minimize the script interface, and keep it
backward-compatible with crm_mon external scripts. We didn't want to add
an OCF-style layer of meta-data, actions and return codes, instead
keeping it as simple as possible for anyone writing one.

Since it's a brand new feature, we definitely want feedback on all
aspects once it's in actual use. If alert script failures turns out to
be a big issue, I could see maybe reporting them in cluster status (and
allowing that to be cleaned up).

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [ClusterLabs] Pacemaker 1.1.15 - Release Candidate 4

2016-06-12 Thread Ferenc Wágner
Ken Gaillot  writes:

> With this release candidate, we now provide three sample alert scripts
> to use with the new alerts feature, installed in the
> /usr/share/pacemaker/alerts directory.

Hi,

Is there a real reason to name these scripts *.sample?  Sure, they are
samples, but they are also usable as-is, aren't they?

> The ./configure script has a new "--with-configdir" option.

This greatly simplifies packaging, thanks much!

Speaking about packaging: are the alert scripts run by remote Pacemaker
nodes?  I couldn't find described which nodes run the alert scripts.
>From the mailing list discussions I recall they are run by each node,
but this would be useful to spell out in the documentation, I think.

Similarly for the alert guarrantees: I recall there's no such thing, but
one could also think they are parts of transactions, thus having recovery
behavior similar to the resource operations.  Hmm... wouldn't such
design actually make sense?
-- 
Thanks,
Feri

___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


[ClusterLabs] [corosync][Problem] Very long "pause detect ... " was detected.

2016-06-12 Thread renayama19661014
Hi All,

Our user constituted a cluster in corosync and Pacemaker in the next 
environment.
The cluster constituted it among guests.

* Host/Guest : RHEL6.6 - kernel : 2.6.32-504.el6.x86_64
* libqb 0.17.1
* corosync 2.3.4
* Pacemaker 1.1.12

The cluster worked well.
When a user stopped an active guest, the next log was output in standby guests 
repeatedly.

May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5515870 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5515920 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5515971 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516021 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516071 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516121 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516171 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516221 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516271 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516322 ms, flushing membership messages.
May xx xx:25:53 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5516372 ms, flushing membership messages.
(snip)
May xx xx:26:03 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5526172 ms, flushing membership messages.
May xx xx:26:03 standby-guest corosync[6311]:  [MAIN  ] Totem is unable to form 
a cluster because of an operating system or network fault. The most common 
cause of this message is that the local firewall is configured improperly.
May xx xx:26:03 standby-guest corosync[6311]:  [TOTEM ] Process pause detected 
for 5526222 ms, flushing membership messages.
(snip)

As a result, the standby guest failed in the construction of the independent 
cluster.

It is recorded in log as if a timer stopped for 91 minutes.
It is abnormal length for 91 minutes.

Did you see a similar problem?

Possibly I think whether it is libqb or Kernel or some kind of problems.
* I suspect that the set of the timer failed in reset_pause_timeout().

Best Regards,
Hideo Yamauchi.


___
Users mailing list: Users@clusterlabs.org
http://clusterlabs.org/mailman/listinfo/users

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org