Re: [ClusterLabs] [corosync][Problem] Very long "pause detect ... " was detected.
Hideo, Hi All, Our user constituted a cluster in corosync and Pacemaker in the next environment. The cluster constituted it among guests. * Host/Guest : RHEL6.6 - kernel : 2.6.32-504.el6.x86_64 * libqb 0.17.1 * corosync 2.3.4 * Pacemaker 1.1.12 The cluster worked well. When a user stopped an active guest, the next log was output in standby guests repeatedly. What exactly you mean by "active guest" and "standby guests"? May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5515870 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5515920 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5515971 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5516021 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5516071 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5516121 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5516171 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5516221 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5516271 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5516322 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5516372 ms, flushing membership messages. (snip) May xx xx:26:03 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5526172 ms, flushing membership messages. May xx xx:26:03 standby-guest corosync[6311]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. May xx xx:26:03 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5526222 ms, flushing membership messages. (snip) This is weird. Not because of enormous pause length but because corosync has a "scheduler pause" detector which warns before "Process pause detected ..." error is logged. As a result, the standby guest failed in the construction of the independent cluster. It is recorded in log as if a timer stopped for 91 minutes. It is abnormal length for 91 minutes. Did you see a similar problem? Never Possibly I think whether it is libqb or Kernel or some kind of problems. What virtualization technology are you using? KVM? * I suspect that the set of the timer failed in reset_pause_timeout(). You can try to put asserts into this function, but there is really not too much reasons why it should fail (ether malloc returns NULL or some nasty memory corruption). Regards, Honza Best Regards, Hideo Yamauchi. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker 1.1.15 - Release Candidate 4
Sorry, I hadn't seen that you had replied (don't know why maybe I just had derived that from the time ;-) ). But I guess what I've written doesn't really contradict your answer ... Although it might make look us a little uncoordinated - anyway sorry again... Klaus On 06/12/2016 04:50 PM, Ken Gaillot wrote: > On 06/12/2016 07:28 AM, Ferenc Wágner wrote: >> Ken Gaillot writes: >> >>> With this release candidate, we now provide three sample alert scripts >>> to use with the new alerts feature, installed in the >>> /usr/share/pacemaker/alerts directory. >> Hi, >> >> Is there a real reason to name these scripts *.sample? Sure, they are >> samples, but they are also usable as-is, aren't they? > Almost as-is -- copy them somewhere, rename them without ".sample", and > mark them executable. > > After some discussion, we decided that this feature is not mature enough > yet to provide the scripts for direct use. After we get some experience > with how users actually use the feature and the sample scripts, we can > gain more confidence in recommending them generally. Until then, we > recommend that people examine the script source and edit it to suit > their needs before using it. > > That said, I think the SNMP script in particular is quite useful. > > The log-to-file script is more a proof-of-concept that people can use as > a template. The SMTP script may be useful, but probably paired with some > custom software handling the recipient address, to avoid flooding a real > person's mailbox when a cluster is active. > >>> The ./configure script has a new "--with-configdir" option. >> This greatly simplifies packaging, thanks much! >> >> Speaking about packaging: are the alert scripts run by remote Pacemaker >> nodes? I couldn't find described which nodes run the alert scripts. >> From the mailing list discussions I recall they are run by each node, >> but this would be useful to spell out in the documentation, I think. > Good point. Alert scripts are run only on cluster nodes, but they > include remote node events. I'll make sure the documentation mentions that. > >> Similarly for the alert guarrantees: I recall there's no such thing, but >> one could also think they are parts of transactions, thus having recovery >> behavior similar to the resource operations. Hmm... wouldn't such >> design actually make sense? > We didn't want to make any cluster operation depend on alert script > success. The only thing we can guarantee is that the cluster will try to > call the alert script for each event. But if the system is going > haywire, for example, we may be unable to spawn a new process due to > some resource exhaustion, and of course the script itself may have problems. > > Also, we wanted to minimize the script interface, and keep it > backward-compatible with crm_mon external scripts. We didn't want to add > an OCF-style layer of meta-data, actions and return codes, instead > keeping it as simple as possible for anyone writing one. > > Since it's a brand new feature, we definitely want feedback on all > aspects once it's in actual use. If alert script failures turns out to > be a big issue, I could see maybe reporting them in cluster status (and > allowing that to be cleaned up). > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker 1.1.15 - Release Candidate 4
On 06/12/2016 02:28 PM, Ferenc Wágner wrote: > Ken Gaillot writes: > >> With this release candidate, we now provide three sample alert scripts >> to use with the new alerts feature, installed in the >> /usr/share/pacemaker/alerts directory. > Hi, > > Is there a real reason to name these scripts *.sample? Sure, they are > samples, but they are also usable as-is, aren't they? Focus is to show how the alert-feature can be used and not smooth and generic usability. There might be use-cases where they might be useful as is but users are still encouraged to have a look at the code and modify it as needed. As the interface stabilizes it might turn out useful to spawn them out like fence- & resource-agents. >> The ./configure script has a new "--with-configdir" option. > This greatly simplifies packaging, thanks much! > > Speaking about packaging: are the alert scripts run by remote Pacemaker > nodes? I couldn't find described which nodes run the alert scripts. > From the mailing list discussions I recall they are run by each node, > but this would be useful to spell out in the documentation, I think. The alert-scripts are being called by crmd. So no alert-scripts on remote-nodes... Had planned to add a few lines about environment and conditions alert-scripts are being run so that sbdy writing one can get a feeling what to expects and be careful with. > > Similarly for the alert guarrantees: I recall there's no such thing, but > one could also think they are parts of transactions, thus having recovery > behavior similar to the resource operations. Hmm... wouldn't such > design actually make sense? One of the goals of the design was to make it as simple as possible and to depend on as little as possible. (e.g. spawn out all alerts in separate processes in parallel as they occur - not 100% fire and forget but quite close ...) ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker 1.1.15 - Release Candidate 4
On 06/12/2016 07:28 AM, Ferenc Wágner wrote: > Ken Gaillot writes: > >> With this release candidate, we now provide three sample alert scripts >> to use with the new alerts feature, installed in the >> /usr/share/pacemaker/alerts directory. > > Hi, > > Is there a real reason to name these scripts *.sample? Sure, they are > samples, but they are also usable as-is, aren't they? Almost as-is -- copy them somewhere, rename them without ".sample", and mark them executable. After some discussion, we decided that this feature is not mature enough yet to provide the scripts for direct use. After we get some experience with how users actually use the feature and the sample scripts, we can gain more confidence in recommending them generally. Until then, we recommend that people examine the script source and edit it to suit their needs before using it. That said, I think the SNMP script in particular is quite useful. The log-to-file script is more a proof-of-concept that people can use as a template. The SMTP script may be useful, but probably paired with some custom software handling the recipient address, to avoid flooding a real person's mailbox when a cluster is active. >> The ./configure script has a new "--with-configdir" option. > > This greatly simplifies packaging, thanks much! > > Speaking about packaging: are the alert scripts run by remote Pacemaker > nodes? I couldn't find described which nodes run the alert scripts. > From the mailing list discussions I recall they are run by each node, > but this would be useful to spell out in the documentation, I think. Good point. Alert scripts are run only on cluster nodes, but they include remote node events. I'll make sure the documentation mentions that. > Similarly for the alert guarrantees: I recall there's no such thing, but > one could also think they are parts of transactions, thus having recovery > behavior similar to the resource operations. Hmm... wouldn't such > design actually make sense? We didn't want to make any cluster operation depend on alert script success. The only thing we can guarantee is that the cluster will try to call the alert script for each event. But if the system is going haywire, for example, we may be unable to spawn a new process due to some resource exhaustion, and of course the script itself may have problems. Also, we wanted to minimize the script interface, and keep it backward-compatible with crm_mon external scripts. We didn't want to add an OCF-style layer of meta-data, actions and return codes, instead keeping it as simple as possible for anyone writing one. Since it's a brand new feature, we definitely want feedback on all aspects once it's in actual use. If alert script failures turns out to be a big issue, I could see maybe reporting them in cluster status (and allowing that to be cleaned up). ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Pacemaker 1.1.15 - Release Candidate 4
Ken Gaillot writes: > With this release candidate, we now provide three sample alert scripts > to use with the new alerts feature, installed in the > /usr/share/pacemaker/alerts directory. Hi, Is there a real reason to name these scripts *.sample? Sure, they are samples, but they are also usable as-is, aren't they? > The ./configure script has a new "--with-configdir" option. This greatly simplifies packaging, thanks much! Speaking about packaging: are the alert scripts run by remote Pacemaker nodes? I couldn't find described which nodes run the alert scripts. >From the mailing list discussions I recall they are run by each node, but this would be useful to spell out in the documentation, I think. Similarly for the alert guarrantees: I recall there's no such thing, but one could also think they are parts of transactions, thus having recovery behavior similar to the resource operations. Hmm... wouldn't such design actually make sense? -- Thanks, Feri ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] [corosync][Problem] Very long "pause detect ... " was detected.
Hi All, Our user constituted a cluster in corosync and Pacemaker in the next environment. The cluster constituted it among guests. * Host/Guest : RHEL6.6 - kernel : 2.6.32-504.el6.x86_64 * libqb 0.17.1 * corosync 2.3.4 * Pacemaker 1.1.12 The cluster worked well. When a user stopped an active guest, the next log was output in standby guests repeatedly. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5515870 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5515920 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5515971 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5516021 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5516071 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5516121 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5516171 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5516221 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5516271 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5516322 ms, flushing membership messages. May xx xx:25:53 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5516372 ms, flushing membership messages. (snip) May xx xx:26:03 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5526172 ms, flushing membership messages. May xx xx:26:03 standby-guest corosync[6311]: [MAIN ] Totem is unable to form a cluster because of an operating system or network fault. The most common cause of this message is that the local firewall is configured improperly. May xx xx:26:03 standby-guest corosync[6311]: [TOTEM ] Process pause detected for 5526222 ms, flushing membership messages. (snip) As a result, the standby guest failed in the construction of the independent cluster. It is recorded in log as if a timer stopped for 91 minutes. It is abnormal length for 91 minutes. Did you see a similar problem? Possibly I think whether it is libqb or Kernel or some kind of problems. * I suspect that the set of the timer failed in reset_pause_timeout(). Best Regards, Hideo Yamauchi. ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org