Re: [Pacemaker] Pacemaker/corosync freeze

2014-09-04 Thread Sreenivasa
Hi Attila,

Did you try compiling libqb 0.17.0 ? Wondering if that solved your issue ?
I also have the same issue. Please suggest if you already solved it.

Thanks
Sreenivasa 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-18 Thread Attila Megyeri
Hello,

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 18, 2014 2:43 AM
 To: Attila Megyeri
 Cc: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 On 13 Mar 2014, at 11:44 pm, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
  Hello,
 
  -Original Message-
  From: Jan Friesse [mailto:jfrie...@redhat.com]
  Sent: Thursday, March 13, 2014 10:03 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  ...
 
 
  Also can you please try to set debug: on in corosync.conf and
  paste full corosync.log then?
 
  I set debug to on, and did a few restarts but could not reproduce
  the issue
  yet - will post the logs as soon as I manage to reproduce.
 
 
  Perfect.
 
  Another option you can try to set is netmtu (1200 is usually safe).
 
  Finally I was able to reproduce the issue.
  I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
  (not
  when node was up again).
 
  The corosync log with debug on is available at:
  http://pastebin.com/kTpDqqtm
 
 
  To be honest, I had to wait much longer for this reproduction as
  before,
  even though there was no change in the corosync configuration - just
  potentially some system updates. But anyway, the issue is
  unfortunately still there.
  Previously, when this issue came, cpu was at 100% on all nodes -
  this time
  only on ctmgr, which was the DC...
 
  I hope you can find some useful details in the log.
 
 
  Attila,
  what seems to be interesting is
 
  Configuration ERRORs found during PE processing.  Please run crm_verify
 -L
  to identify issues.
 
  I'm unsure how much is this problem but I'm really not pacemaker expert.
 
  Perhaps Andrew could comment on that. Any idea?
 
 Did you run the command?  What did it say?

Yes, all was fine. This is why I found it strange.



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-18 Thread Andrew Beekhof

On 18 Mar 2014, at 6:03 pm, Attila Megyeri amegy...@minerva-soft.com wrote:

 Hello,
 
 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 18, 2014 2:43 AM
 To: Attila Megyeri
 Cc: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 On 13 Mar 2014, at 11:44 pm, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
 Hello,
 
 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Thursday, March 13, 2014 10:03 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 ...
 
 Attila,
 what seems to be interesting is
 
 Configuration ERRORs found during PE processing.  Please run crm_verify
 -L
 to identify issues.
 
 I'm unsure how much is this problem but I'm really not pacemaker expert.
 
 Perhaps Andrew could comment on that. Any idea?
 
 Did you run the command?  What did it say?
 
 Yes, all was fine. This is why I found it strange.

If you still have /var/lib/pacemaker/pengine/pe-error-7.bz2 from ctdb2, then I 
should be able to figure out what it was complaining about.
(You can also run: crm_verify --xml-file 
/var/lib/pacemaker/pengine/pe-error-7.bz2 -V )


signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-18 Thread Attila Megyeri
Hi Andrew,


 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 18, 2014 11:40 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 On 18 Mar 2014, at 6:03 pm, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
  Hello,
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 18, 2014 2:43 AM
  To: Attila Megyeri
  Cc: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 13 Mar 2014, at 11:44 pm, Attila Megyeri amegyeri@minerva-
 soft.com
  wrote:
 
  Hello,
 
  -Original Message-
  From: Jan Friesse [mailto:jfrie...@redhat.com]
  Sent: Thursday, March 13, 2014 10:03 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  ...
 
  Attila,
  what seems to be interesting is
 
  Configuration ERRORs found during PE processing.  Please run
 crm_verify
  -L
  to identify issues.
 
  I'm unsure how much is this problem but I'm really not pacemaker
 expert.
 
  Perhaps Andrew could comment on that. Any idea?
 
  Did you run the command?  What did it say?
 
  Yes, all was fine. This is why I found it strange.
 
 If you still have /var/lib/pacemaker/pengine/pe-error-7.bz2 from ctdb2, then
 I should be able to figure out what it was complaining about.
 (You can also run: crm_verify --xml-file /var/lib/pacemaker/pengine/pe-
 error-7.bz2 -V )

The file is still there, and crm_veryfy check is successful (error 0) and no 
output. The file is full of confidential data but if you think you can find 
something useful in it I can send it in a direct mail.

thanks!





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-17 Thread Attila Megyeri
Hi David, Jan,

For the time being corosync 2.3.3 looks stable with libqb 0.17.0 with both 
build from source.
Thank you very much for the guidance!

Attila

 -Original Message-
 From: David Vossel [mailto:dvos...@redhat.com]
 Sent: Thursday, March 13, 2014 9:22 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 
 
 
 - Original Message -
  From: Jan Friesse jfrie...@redhat.com
  To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org
  Sent: Thursday, March 13, 2014 4:03:28 AM
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  ...
 
  
   Also can you please try to set debug: on in corosync.conf and
   paste full corosync.log then?
  
   I set debug to on, and did a few restarts but could not reproduce
   the issue
   yet - will post the logs as soon as I manage to reproduce.
  
  
   Perfect.
  
   Another option you can try to set is netmtu (1200 is usually safe).
  
   Finally I was able to reproduce the issue.
   I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
   (not when node was up again).
  
   The corosync log with debug on is available at:
   http://pastebin.com/kTpDqqtm
  
  
   To be honest, I had to wait much longer for this reproduction as
   before, even though there was no change in the corosync
   configuration - just potentially some system updates. But anyway,
   the issue is unfortunately still there.
   Previously, when this issue came, cpu was at 100% on all nodes -
   this time only on ctmgr, which was the DC...
  
   I hope you can find some useful details in the log.
  
 
  Attila,
  what seems to be interesting is
 
  Configuration ERRORs found during PE processing.  Please run
  crm_verify -L to identify issues.
 
  I'm unsure how much is this problem but I'm really not pacemaker expert.
 
  Anyway, I have theory what may happening and it looks like related
  with IPC (and probably not related to network). But to make sure we
  will not try fixing already fixed bug, can you please build:
  - New libqb (0.17.0). There are plenty of fixes in IPC
  - Corosync 2.3.3 (already plenty IPC fixes)
 
 yes, there was a libqb/corosync interoperation problem that showed these
 same symptoms last year. Updating to the latest corosync and libqb will likely
 resolve this.
 
  - And maybe also newer pacemaker
 
  I know you were not very happy using hand-compiled sources, but please
  give them at least a try.
 
  Thanks,
Honza
 
   Thanks,
   Attila
  
  
  
  
   Regards,
 Honza
  
  
   There are also a few things that might or might not be related:
  
   1) Whenever I want to edit the configuration with crm configure
   edit,
 
  ...
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org Getting started:
  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-17 Thread Andrew Beekhof

On 13 Mar 2014, at 11:44 pm, Attila Megyeri amegy...@minerva-soft.com wrote:

 Hello,
 
 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Thursday, March 13, 2014 10:03 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 ...
 
 
 Also can you please try to set debug: on in corosync.conf and paste
 full corosync.log then?
 
 I set debug to on, and did a few restarts but could not reproduce
 the issue
 yet - will post the logs as soon as I manage to reproduce.
 
 
 Perfect.
 
 Another option you can try to set is netmtu (1200 is usually safe).
 
 Finally I was able to reproduce the issue.
 I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not
 when node was up again).
 
 The corosync log with debug on is available at:
 http://pastebin.com/kTpDqqtm
 
 
 To be honest, I had to wait much longer for this reproduction as before,
 even though there was no change in the corosync configuration - just
 potentially some system updates. But anyway, the issue is unfortunately still
 there.
 Previously, when this issue came, cpu was at 100% on all nodes - this time
 only on ctmgr, which was the DC...
 
 I hope you can find some useful details in the log.
 
 
 Attila,
 what seems to be interesting is
 
 Configuration ERRORs found during PE processing.  Please run crm_verify -L
 to identify issues.
 
 I'm unsure how much is this problem but I'm really not pacemaker expert.
 
 Perhaps Andrew could comment on that. Any idea?

Did you run the command?  What did it say?



signature.asc
Description: Message signed with OpenPGP using GPGMail
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-14 Thread Attila Megyeri
Hello David,


 -Original Message-
 From: David Vossel [mailto:dvos...@redhat.com]
 Sent: Thursday, March 13, 2014 9:22 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 
 
 
 - Original Message -
  From: Jan Friesse jfrie...@redhat.com
  To: The Pacemaker cluster resource manager
  pacemaker@oss.clusterlabs.org
  Sent: Thursday, March 13, 2014 4:03:28 AM
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  ...
 
  
   Also can you please try to set debug: on in corosync.conf and
   paste full corosync.log then?
  
   I set debug to on, and did a few restarts but could not reproduce
   the issue
   yet - will post the logs as soon as I manage to reproduce.
  
  
   Perfect.
  
   Another option you can try to set is netmtu (1200 is usually safe).
  
   Finally I was able to reproduce the issue.
   I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
   (not when node was up again).
  
   The corosync log with debug on is available at:
   http://pastebin.com/kTpDqqtm
  
  
   To be honest, I had to wait much longer for this reproduction as
   before, even though there was no change in the corosync
   configuration - just potentially some system updates. But anyway,
   the issue is unfortunately still there.
   Previously, when this issue came, cpu was at 100% on all nodes -
   this time only on ctmgr, which was the DC...
  
   I hope you can find some useful details in the log.
  
 
  Attila,
  what seems to be interesting is
 
  Configuration ERRORs found during PE processing.  Please run
  crm_verify -L to identify issues.
 
  I'm unsure how much is this problem but I'm really not pacemaker expert.
 
  Anyway, I have theory what may happening and it looks like related
  with IPC (and probably not related to network). But to make sure we
  will not try fixing already fixed bug, can you please build:
  - New libqb (0.17.0). There are plenty of fixes in IPC
  - Corosync 2.3.3 (already plenty IPC fixes)
 
 yes, there was a libqb/corosync interoperation problem that showed these
 same symptoms last year. Updating to the latest corosync and libqb will likely
 resolve this.

I have upgraded all nodes to these version and we are testing. So far no issues.
Thank you very much for your help.

Regards,
Attila





 
  - And maybe also newer pacemaker
 
  I know you were not very happy using hand-compiled sources, but please
  give them at least a try.
 
  Thanks,
Honza
 
   Thanks,
   Attila
  
  
  
  
   Regards,
 Honza
  
  
   There are also a few things that might or might not be related:
  
   1) Whenever I want to edit the configuration with crm configure
   edit,
 
  ...
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org Getting started:
  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-14 Thread Jan Friesse
Attila Megyeri napsal(a):
 Hi Honza,
 
 What I also found in the log related to the freeze at 12:22:26:
 
 
 Corosync main process was not scheduled for  ... Can It be the general 
 cause of the issue?
 
 
 
 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
 [10.9.1.5]:58597-[10.9.1.3]:161
 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
 [10.9.1.5]:47943-[10.9.1.3]:161
 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
 [10.9.1.5]:47943-[10.9.1.3]:161
 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
 [10.9.1.5]:59647-[10.9.1.3]:161
 
 
 Mar 13 12:22:26 ctmgr corosync[3024]:   [MAIN  ] Corosync main process was 
 not scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token 
 timeout increase.
 
 
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] The token was lost in the 
 OPERATIONAL state.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] A processor failed, forming 
 new configuration.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering GATHER state from 
 2(The token was lost in the OPERATIONAL state.).
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Creating commit token 
 because I am the rep.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Saving state aru 6a8c high 
 seq received 6a8c
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Storing new sequence id for 
 ring 7dc
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering COMMIT state.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] got commit token
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering RECOVERY state.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [0] member 10.9.1.3:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [1] member 10.9.1.41:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [2] member 10.9.1.42:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [3] member 10.9.1.71:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [4] member 10.9.1.72:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [5] member 10.9.2.11:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [6] member 10.9.2.12:
 
 
 
 
 Regards,
 Attila
 
 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: Thursday, March 13, 2014 2:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: Thursday, March 13, 2014 1:45 PM
 To: The Pacemaker cluster resource manager; Andrew Beekhof
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Hello,

 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Thursday, March 13, 2014 10:03 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 ...


 Also can you please try to set debug: on in corosync.conf and
 paste full corosync.log then?

 I set debug to on, and did a few restarts but could not
 reproduce the issue
 yet - will post the logs as soon as I manage to reproduce.


 Perfect.

 Another option you can try to set is netmtu (1200 is usually safe).

 Finally I was able to reproduce the issue.
 I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
 (not
 when node was up again).

 The corosync log with debug on is available at:
 http://pastebin.com/kTpDqqtm


 To be honest, I had to wait much longer for this reproduction as
 before,
 even though there was no change in the corosync configuration - just
 potentially some system updates. But anyway, the issue is
 unfortunately still there.
 Previously, when this issue came, cpu was at 100% on all nodes -
 this time
 only on ctmgr, which was the DC...

 I hope you can find some useful details in the log.


 Attila,
 what seems to be interesting is

 Configuration ERRORs found during PE processing.  Please run
 crm_verify -
 L
 to identify issues.

 I'm unsure how much is this problem but I'm really not pacemaker
 expert.

 Perhaps Andrew could comment on that. Any idea?



 Anyway, I have theory what may happening and it looks like related
 with IPC (and probably not related to network). But to make sure we
 will not try fixing already fixed bug, can you please build:
 - New libqb (0.17.0). There are plenty of fixes in IPC
 - Corosync 2.3.3 (already plenty IPC fixes)
 - And maybe also newer pacemaker


 I already use Corosync 2.3.3, built from source, and libqb-dev 0.16
 from Ubuntu package.
 I am currently building libqb 0.17.0, will update you on the results.

 In the meantime we had another freeze, which did not seem to be
 related to any restarts, but brought all coroync processes to 100%.
 Please check out the corosync.log, perhaps it is a different cause:
 http://pastebin.com/WMwzv0Rr


 In the meantime I will install the new libqb and send logs if we have
 further issues.

 Thank you very much for your help!

 Regards,
 Attila


 One more question:

 If I install libqb 0.17.0 from

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-14 Thread Jan Friesse
Attila Megyeri napsal(a):
 Hi Honza,
 
 What I also found in the log related to the freeze at 12:22:26:
 
 
 Corosync main process was not scheduled for  ... Can It be the general 
 cause of the issue?
 

I don't think it will cause issue you are hitting BUT keep in mind that
if corosync is not scheduled for long time, it's probably fenced by
other node. So increase timeout is vital.

Honza

 
 
 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
 [10.9.1.5]:58597-[10.9.1.3]:161
 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
 [10.9.1.5]:47943-[10.9.1.3]:161
 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
 [10.9.1.5]:47943-[10.9.1.3]:161
 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
 [10.9.1.5]:59647-[10.9.1.3]:161
 
 
 Mar 13 12:22:26 ctmgr corosync[3024]:   [MAIN  ] Corosync main process was 
 not scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token 
 timeout increase.
 
 
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] The token was lost in the 
 OPERATIONAL state.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] A processor failed, forming 
 new configuration.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering GATHER state from 
 2(The token was lost in the OPERATIONAL state.).
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Creating commit token 
 because I am the rep.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Saving state aru 6a8c high 
 seq received 6a8c
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Storing new sequence id for 
 ring 7dc
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering COMMIT state.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] got commit token
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering RECOVERY state.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [0] member 10.9.1.3:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [1] member 10.9.1.41:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [2] member 10.9.1.42:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [3] member 10.9.1.71:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [4] member 10.9.1.72:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [5] member 10.9.2.11:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [6] member 10.9.2.12:
 
 
 
 
 Regards,
 Attila
 
 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: Thursday, March 13, 2014 2:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: Thursday, March 13, 2014 1:45 PM
 To: The Pacemaker cluster resource manager; Andrew Beekhof
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Hello,

 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Thursday, March 13, 2014 10:03 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 ...


 Also can you please try to set debug: on in corosync.conf and
 paste full corosync.log then?

 I set debug to on, and did a few restarts but could not
 reproduce the issue
 yet - will post the logs as soon as I manage to reproduce.


 Perfect.

 Another option you can try to set is netmtu (1200 is usually safe).

 Finally I was able to reproduce the issue.
 I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
 (not
 when node was up again).

 The corosync log with debug on is available at:
 http://pastebin.com/kTpDqqtm


 To be honest, I had to wait much longer for this reproduction as
 before,
 even though there was no change in the corosync configuration - just
 potentially some system updates. But anyway, the issue is
 unfortunately still there.
 Previously, when this issue came, cpu was at 100% on all nodes -
 this time
 only on ctmgr, which was the DC...

 I hope you can find some useful details in the log.


 Attila,
 what seems to be interesting is

 Configuration ERRORs found during PE processing.  Please run
 crm_verify -
 L
 to identify issues.

 I'm unsure how much is this problem but I'm really not pacemaker
 expert.

 Perhaps Andrew could comment on that. Any idea?



 Anyway, I have theory what may happening and it looks like related
 with IPC (and probably not related to network). But to make sure we
 will not try fixing already fixed bug, can you please build:
 - New libqb (0.17.0). There are plenty of fixes in IPC
 - Corosync 2.3.3 (already plenty IPC fixes)
 - And maybe also newer pacemaker


 I already use Corosync 2.3.3, built from source, and libqb-dev 0.16
 from Ubuntu package.
 I am currently building libqb 0.17.0, will update you on the results.

 In the meantime we had another freeze, which did not seem to be
 related to any restarts, but brought all coroync processes to 100%.
 Please check out the corosync.log, perhaps it is a different cause:
 http://pastebin.com/WMwzv0Rr

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-13 Thread Jan Friesse
...


 Also can you please try to set debug: on in corosync.conf and paste
 full corosync.log then?

 I set debug to on, and did a few restarts but could not reproduce the issue
 yet - will post the logs as soon as I manage to reproduce.


 Perfect.

 Another option you can try to set is netmtu (1200 is usually safe).
 
 Finally I was able to reproduce the issue.
 I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when 
 node was up again).
 
 The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm
 
 
 To be honest, I had to wait much longer for this reproduction as before, even 
 though there was no change in the corosync configuration - just potentially 
 some system updates. But anyway, the issue is unfortunately still there.
 Previously, when this issue came, cpu was at 100% on all nodes - this time 
 only on ctmgr, which was the DC...
 
 I hope you can find some useful details in the log.
 

Attila,
what seems to be interesting is

Configuration ERRORs found during PE processing.  Please run crm_verify
-L to identify issues.

I'm unsure how much is this problem but I'm really not pacemaker expert.

Anyway, I have theory what may happening and it looks like related with
IPC (and probably not related to network). But to make sure we will not
try fixing already fixed bug, can you please build:
- New libqb (0.17.0). There are plenty of fixes in IPC
- Corosync 2.3.3 (already plenty IPC fixes)
- And maybe also newer pacemaker

I know you were not very happy using hand-compiled sources, but please
give them at least a try.

Thanks,
  Honza

 Thanks,
 Attila
 
 
 

 Regards,
   Honza


 There are also a few things that might or might not be related:

 1) Whenever I want to edit the configuration with crm configure edit,

...

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-13 Thread Attila Megyeri
Hello,

 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Thursday, March 13, 2014 10:03 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 ...
 
 
  Also can you please try to set debug: on in corosync.conf and paste
  full corosync.log then?
 
  I set debug to on, and did a few restarts but could not reproduce
  the issue
  yet - will post the logs as soon as I manage to reproduce.
 
 
  Perfect.
 
  Another option you can try to set is netmtu (1200 is usually safe).
 
  Finally I was able to reproduce the issue.
  I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not
 when node was up again).
 
  The corosync log with debug on is available at:
  http://pastebin.com/kTpDqqtm
 
 
  To be honest, I had to wait much longer for this reproduction as before,
 even though there was no change in the corosync configuration - just
 potentially some system updates. But anyway, the issue is unfortunately still
 there.
  Previously, when this issue came, cpu was at 100% on all nodes - this time
 only on ctmgr, which was the DC...
 
  I hope you can find some useful details in the log.
 
 
 Attila,
 what seems to be interesting is
 
 Configuration ERRORs found during PE processing.  Please run crm_verify -L
 to identify issues.
 
 I'm unsure how much is this problem but I'm really not pacemaker expert.

Perhaps Andrew could comment on that. Any idea?


 
 Anyway, I have theory what may happening and it looks like related with IPC
 (and probably not related to network). But to make sure we will not try fixing
 already fixed bug, can you please build:
 - New libqb (0.17.0). There are plenty of fixes in IPC
 - Corosync 2.3.3 (already plenty IPC fixes)
 - And maybe also newer pacemaker
 

I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from Ubuntu 
package.
I am currently building libqb 0.17.0, will update you on the results.

In the meantime we had another freeze, which did not seem to be related to any 
restarts, but brought all coroync processes to 100%.
Please check out the corosync.log, perhaps it is a different cause: 
http://pastebin.com/WMwzv0Rr 


In the meantime I will install the new libqb and send logs if we have further 
issues.

Thank you very much for your help!

Regards,
Attila



 I know you were not very happy using hand-compiled sources, but please
 give them at least a try.
 
 Thanks,
   Honza
 
  Thanks,
  Attila
 
 
 
 
  Regards,
Honza
 
 
  There are also a few things that might or might not be related:
 
  1) Whenever I want to edit the configuration with crm configure
  edit,
 
 ...
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-13 Thread Attila Megyeri

 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: Thursday, March 13, 2014 1:45 PM
 To: The Pacemaker cluster resource manager; Andrew Beekhof
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 Hello,
 
  -Original Message-
  From: Jan Friesse [mailto:jfrie...@redhat.com]
  Sent: Thursday, March 13, 2014 10:03 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  ...
 
  
   Also can you please try to set debug: on in corosync.conf and
   paste full corosync.log then?
  
   I set debug to on, and did a few restarts but could not reproduce
   the issue
   yet - will post the logs as soon as I manage to reproduce.
  
  
   Perfect.
  
   Another option you can try to set is netmtu (1200 is usually safe).
  
   Finally I was able to reproduce the issue.
   I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
   (not
  when node was up again).
  
   The corosync log with debug on is available at:
   http://pastebin.com/kTpDqqtm
  
  
   To be honest, I had to wait much longer for this reproduction as
   before,
  even though there was no change in the corosync configuration - just
  potentially some system updates. But anyway, the issue is
  unfortunately still there.
   Previously, when this issue came, cpu was at 100% on all nodes -
   this time
  only on ctmgr, which was the DC...
  
   I hope you can find some useful details in the log.
  
 
  Attila,
  what seems to be interesting is
 
  Configuration ERRORs found during PE processing.  Please run crm_verify -
 L
  to identify issues.
 
  I'm unsure how much is this problem but I'm really not pacemaker expert.
 
 Perhaps Andrew could comment on that. Any idea?
 
 
 
  Anyway, I have theory what may happening and it looks like related
  with IPC (and probably not related to network). But to make sure we
  will not try fixing already fixed bug, can you please build:
  - New libqb (0.17.0). There are plenty of fixes in IPC
  - Corosync 2.3.3 (already plenty IPC fixes)
  - And maybe also newer pacemaker
 
 
 I already use Corosync 2.3.3, built from source, and libqb-dev 0.16 from
 Ubuntu package.
 I am currently building libqb 0.17.0, will update you on the results.
 
 In the meantime we had another freeze, which did not seem to be related to
 any restarts, but brought all coroync processes to 100%.
 Please check out the corosync.log, perhaps it is a different cause:
 http://pastebin.com/WMwzv0Rr
 
 
 In the meantime I will install the new libqb and send logs if we have further
 issues.
 
 Thank you very much for your help!
 
 Regards,
 Attila
 

One more question:

If I install libqb 0.17.0 from source, do I need to rebuild corosync as well, 
or if it was built with libqb 0.16.0 it will be fine?

BTW, in the meantime I installed the new libqb on 3 of the 7 hosts, so I can 
see if it makes a difference. If I see crashes on the outdated ones, but not on 
the new ones, we are fine. :)

Thanks,

Attila







 
 
  I know you were not very happy using hand-compiled sources, but please
  give them at least a try.
 
  Thanks,
Honza
 
   Thanks,
   Attila
  
  
  
  
   Regards,
 Honza
  
  
   There are also a few things that might or might not be related:
  
   1) Whenever I want to edit the configuration with crm configure
   edit,
 
  ...
 
  ___
  Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
  http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
  Project Home: http://www.clusterlabs.org Getting started:
  http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
  Bugs: http://bugs.clusterlabs.org
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-13 Thread Attila Megyeri
Hi Honza,

What I also found in the log related to the freeze at 12:22:26:


Corosync main process was not scheduled for  ... Can It be the general 
cause of the issue?



Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
[10.9.1.5]:58597-[10.9.1.3]:161
Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
[10.9.1.5]:47943-[10.9.1.3]:161
Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
[10.9.1.5]:47943-[10.9.1.3]:161
Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
[10.9.1.5]:59647-[10.9.1.3]:161


Mar 13 12:22:26 ctmgr corosync[3024]:   [MAIN  ] Corosync main process was not 
scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token timeout 
increase.


Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] The token was lost in the 
OPERATIONAL state.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] A processor failed, forming 
new configuration.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering GATHER state from 
2(The token was lost in the OPERATIONAL state.).
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Creating commit token because 
I am the rep.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Saving state aru 6a8c high seq 
received 6a8c
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Storing new sequence id for 
ring 7dc
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering COMMIT state.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] got commit token
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering RECOVERY state.
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [0] member 10.9.1.3:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [1] member 10.9.1.41:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [2] member 10.9.1.42:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [3] member 10.9.1.71:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [4] member 10.9.1.72:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [5] member 10.9.2.11:
Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [6] member 10.9.2.12:




Regards,
Attila

 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: Thursday, March 13, 2014 2:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  -Original Message-
  From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
  Sent: Thursday, March 13, 2014 1:45 PM
  To: The Pacemaker cluster resource manager; Andrew Beekhof
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  Hello,
 
   -Original Message-
   From: Jan Friesse [mailto:jfrie...@redhat.com]
   Sent: Thursday, March 13, 2014 10:03 AM
   To: The Pacemaker cluster resource manager
   Subject: Re: [Pacemaker] Pacemaker/corosync freeze
  
   ...
  
   
Also can you please try to set debug: on in corosync.conf and
paste full corosync.log then?
   
I set debug to on, and did a few restarts but could not
reproduce the issue
yet - will post the logs as soon as I manage to reproduce.
   
   
Perfect.
   
Another option you can try to set is netmtu (1200 is usually safe).
   
Finally I was able to reproduce the issue.
I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
(not
   when node was up again).
   
The corosync log with debug on is available at:
http://pastebin.com/kTpDqqtm
   
   
To be honest, I had to wait much longer for this reproduction as
before,
   even though there was no change in the corosync configuration - just
   potentially some system updates. But anyway, the issue is
   unfortunately still there.
Previously, when this issue came, cpu was at 100% on all nodes -
this time
   only on ctmgr, which was the DC...
   
I hope you can find some useful details in the log.
   
  
   Attila,
   what seems to be interesting is
  
   Configuration ERRORs found during PE processing.  Please run
   crm_verify -
  L
   to identify issues.
  
   I'm unsure how much is this problem but I'm really not pacemaker
 expert.
 
  Perhaps Andrew could comment on that. Any idea?
 
 
  
   Anyway, I have theory what may happening and it looks like related
   with IPC (and probably not related to network). But to make sure we
   will not try fixing already fixed bug, can you please build:
   - New libqb (0.17.0). There are plenty of fixes in IPC
   - Corosync 2.3.3 (already plenty IPC fixes)
   - And maybe also newer pacemaker
  
 
  I already use Corosync 2.3.3, built from source, and libqb-dev 0.16
  from Ubuntu package.
  I am currently building libqb 0.17.0, will update you on the results.
 
  In the meantime we had another freeze, which did not seem to be
  related to any restarts, but brought all coroync processes to 100%.
  Please check out the corosync.log, perhaps it is a different cause:
  http://pastebin.com/WMwzv0Rr
 
 
  In the meantime I will install the new libqb and send logs if we have
  further issues.
 
  Thank you very much

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-13 Thread David Vossel




- Original Message -
 From: Jan Friesse jfrie...@redhat.com
 To: The Pacemaker cluster resource manager pacemaker@oss.clusterlabs.org
 Sent: Thursday, March 13, 2014 4:03:28 AM
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 ...
 
 
  Also can you please try to set debug: on in corosync.conf and paste
  full corosync.log then?
 
  I set debug to on, and did a few restarts but could not reproduce the
  issue
  yet - will post the logs as soon as I manage to reproduce.
 
 
  Perfect.
 
  Another option you can try to set is netmtu (1200 is usually safe).
  
  Finally I was able to reproduce the issue.
  I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not
  when node was up again).
  
  The corosync log with debug on is available at:
  http://pastebin.com/kTpDqqtm
  
  
  To be honest, I had to wait much longer for this reproduction as before,
  even though there was no change in the corosync configuration - just
  potentially some system updates. But anyway, the issue is unfortunately
  still there.
  Previously, when this issue came, cpu was at 100% on all nodes - this time
  only on ctmgr, which was the DC...
  
  I hope you can find some useful details in the log.
  
 
 Attila,
 what seems to be interesting is
 
 Configuration ERRORs found during PE processing.  Please run crm_verify
 -L to identify issues.
 
 I'm unsure how much is this problem but I'm really not pacemaker expert.
 
 Anyway, I have theory what may happening and it looks like related with
 IPC (and probably not related to network). But to make sure we will not
 try fixing already fixed bug, can you please build:
 - New libqb (0.17.0). There are plenty of fixes in IPC
 - Corosync 2.3.3 (already plenty IPC fixes)

yes, there was a libqb/corosync interoperation problem that showed these same 
symptoms last year. Updating to the latest corosync and libqb will likely 
resolve this.

 - And maybe also newer pacemaker
 
 I know you were not very happy using hand-compiled sources, but please
 give them at least a try.
 
 Thanks,
   Honza
 
  Thanks,
  Attila
  
  
  
 
  Regards,
Honza
 
 
  There are also a few things that might or might not be related:
 
  1) Whenever I want to edit the configuration with crm configure edit,
 
 ...
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 10:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 12:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com
  wrote:
 
  Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Hello,
 
  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and suddenly the
  crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses 100%
  cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
 
  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart
  corosync, then
  start pacemeker. Stoping pacemaker and corosync does not work in
  most of the cases, usually a kill -9 is needed.
 
  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
  Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
 
  Logs are usually flooded with CPG related messages, such as:
 
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
 
  OR
 
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly
  confused
  state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
 
 
  As I wrote I use Ubuntu trusty, the exact package versions are:
 
  corosync 2.3.0-1ubuntu5
  pacemaker 1.1.10+git20130802-1ubuntu2
 
  Ah sorry, I seem to have missed that part.
 
 
  There are no updates available. The only option is to install from
  sources,
  but that would be very difficult to maintain and I'm not sure I would
  get rid of this issue.
 
  What do you recommend?
 
  The same thing as Lars, or switch to a distro that stays current with
  upstream (git shows 5 newer releases for that branch since it was
  released 3 years ago).
  If you do build from source, its probably best to go with v1.4.6
 
  Hm, I am a bit confused here. We are using 2.3.0,
 
 I swapped the 2 for a 1 somehow. A bit distracted, sorry.

I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the 
same issue - after some time CPU gets to 100%, and the corosync log is flooded 
with messages like:

Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 
CPG messages  (48 remaining, last=3671): Try again (6)
Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 0 
CPG messages  (51 remaining, last=3995): Try again (6)
Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 
CPG messages  (48 remaining, last=3671): Try again (6)
Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 0 
CPG messages  (51 remaining, last=3995): Try again (6)
Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 
CPG messages  (48 remaining, last=3671): Try again (6)
Mar 12 07:36:57 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 0 
CPG messages  (51 remaining, last=3995): Try again (6)
Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 0 
CPG messages  (48 remaining, last=3671): Try again (6)


Shall I try to downgrade to 1.4.6? What is the difference in that build? Or 
where should I start troubleshooting?

Thank you in advance.






 
  which was released approx. a year ago (you mention 3

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Jan Friesse
Attila Megyeri napsal(a):
 
 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 10:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com
 wrote:


 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 12:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com
 wrote:

 Thanks for the quick response!

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Friday, March 07, 2014 3:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:31 am, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:

 Hello,

 We have a strange issue with Corosync/Pacemaker.
 From time to time, something unexpected happens and suddenly the
 crm_mon output remains static.
 When I check the cpu usage, I see that one of the cores uses 100%
 cpu, but
 cannot actually match it to either the corosync or one of the
 pacemaker processes.

 In such a case, this high CPU usage is happening on all 7 nodes.
 I have to manually go to each node, stop pacemaker, restart
 corosync, then
 start pacemeker. Stoping pacemaker and corosync does not work in
 most of the cases, usually a kill -9 is needed.

 Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.

 Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.

 Logs are usually flooded with CPG related messages, such as:

 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)

 OR

 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (

 That is usually a symptom of corosync getting into a horribly
 confused
 state.
 Version? Distro? Have you checked for an update?
 Odd that the user of all that CPU isn't showing up though.



 As I wrote I use Ubuntu trusty, the exact package versions are:

 corosync 2.3.0-1ubuntu5
 pacemaker 1.1.10+git20130802-1ubuntu2

 Ah sorry, I seem to have missed that part.


 There are no updates available. The only option is to install from
 sources,
 but that would be very difficult to maintain and I'm not sure I would
 get rid of this issue.

 What do you recommend?

 The same thing as Lars, or switch to a distro that stays current with
 upstream (git shows 5 newer releases for that branch since it was
 released 3 years ago).
 If you do build from source, its probably best to go with v1.4.6

 Hm, I am a bit confused here. We are using 2.3.0,

 I swapped the 2 for a 1 somehow. A bit distracted, sorry.
 
 I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the 
 same issue - after some time CPU gets to 100%, and the corosync log is 
 flooded with messages like:
 
 Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:Sent 
 0 CPG messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 
 0 CPG messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:Sent 
 0 CPG messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 
 0 CPG messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 
 0 CPG messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:57 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 
 0 CPG messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 
 0 CPG messages  (48 remaining, last=3671): Try again (6)
 
 

Attila,

 Shall I try to downgrade to 1.4.6? What is the difference in that build? Or 
 where should I start troubleshooting?

First of all, 1.x branch (flatiron) is maintained so even it looks like
a old version, it's quite a new. It contains more or less only

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri
Hello Jan,

Thank you very much for your help so far.

 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 9:51 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 Attila Megyeri napsal(a):
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 10:27 PM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 12 Mar 2014, at 1:54 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 12:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:54 pm, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Hello,
 
  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and suddenly
 the
  crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses
  100% cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
 
  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart
  corosync, then
  start pacemeker. Stoping pacemaker and corosync does not work in
  most of the cases, usually a kill -9 is needed.
 
  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
  Using udpu as transport, two rings on Gigabit ETH, rro_mode
 passive.
 
  Logs are usually flooded with CPG related messages, such as:
 
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
Sent
 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
Sent
 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
Sent
 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
Sent
 0
  CPG
  messages  (1 remaining, last=8): Try again (6)
 
  OR
 
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly
  confused
  state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
 
 
  As I wrote I use Ubuntu trusty, the exact package versions are:
 
  corosync 2.3.0-1ubuntu5
  pacemaker 1.1.10+git20130802-1ubuntu2
 
  Ah sorry, I seem to have missed that part.
 
 
  There are no updates available. The only option is to install from
  sources,
  but that would be very difficult to maintain and I'm not sure I
  would get rid of this issue.
 
  What do you recommend?
 
  The same thing as Lars, or switch to a distro that stays current
  with upstream (git shows 5 newer releases for that branch since it
  was released 3 years ago).
  If you do build from source, its probably best to go with v1.4.6
 
  Hm, I am a bit confused here. We are using 2.3.0,
 
  I swapped the 2 for a 1 somehow. A bit distracted, sorry.
 
  I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still 
  the
 same issue - after some time CPU gets to 100%, and the corosync log is
 flooded with messages like:
 
  Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0 CPG
 messages  (51 remaining, last=3995): Try again (6)
  Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0 CPG
 messages  (51 remaining, last=3995): Try again (6)
  Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:57 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0 CPG
 messages  (51 remaining, last=3995

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Jan Friesse
Attila Megyeri napsal(a):
 Hello Jan,
 
 Thank you very much for your help so far.
 
 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 9:51 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Attila Megyeri napsal(a):

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 10:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 12 Mar 2014, at 1:54 am, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:


 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 12:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:54 pm, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:

 Thanks for the quick response!

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Friday, March 07, 2014 3:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:31 am, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:

 Hello,

 We have a strange issue with Corosync/Pacemaker.
 From time to time, something unexpected happens and suddenly
 the
 crm_mon output remains static.
 When I check the cpu usage, I see that one of the cores uses
 100% cpu, but
 cannot actually match it to either the corosync or one of the
 pacemaker processes.

 In such a case, this high CPU usage is happening on all 7 nodes.
 I have to manually go to each node, stop pacemaker, restart
 corosync, then
 start pacemeker. Stoping pacemaker and corosync does not work in
 most of the cases, usually a kill -9 is needed.

 Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.

 Using udpu as transport, two rings on Gigabit ETH, rro_mode
 passive.

 Logs are usually flooded with CPG related messages, such as:

 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
   Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
   Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
   Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
   Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)

 OR

 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
   Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
   Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
   Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (

 That is usually a symptom of corosync getting into a horribly
 confused
 state.
 Version? Distro? Have you checked for an update?
 Odd that the user of all that CPU isn't showing up though.



 As I wrote I use Ubuntu trusty, the exact package versions are:

 corosync 2.3.0-1ubuntu5
 pacemaker 1.1.10+git20130802-1ubuntu2

 Ah sorry, I seem to have missed that part.


 There are no updates available. The only option is to install from
 sources,
 but that would be very difficult to maintain and I'm not sure I
 would get rid of this issue.

 What do you recommend?

 The same thing as Lars, or switch to a distro that stays current
 with upstream (git shows 5 newer releases for that branch since it
 was released 3 years ago).
 If you do build from source, its probably best to go with v1.4.6

 Hm, I am a bit confused here. We are using 2.3.0,

 I swapped the 2 for a 1 somehow. A bit distracted, sorry.

 I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still 
 the
 same issue - after some time CPU gets to 100%, and the corosync log is
 flooded with messages like:

 Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:
 Sent 0 CPG
 messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:
 Sent 0 CPG
 messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:57 [4798] ctdb2   crmd: info: crm_cs_flush:
 Sent 0 CPG
 messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri
 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 2:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Attila Megyeri napsal(a):
  Hello Jan,
 
  Thank you very much for your help so far.
 
  -Original Message-
  From: Jan Friesse [mailto:jfrie...@redhat.com]
  Sent: Wednesday, March 12, 2014 9:51 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  Attila Megyeri napsal(a):
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 10:27 PM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 12 Mar 2014, at 1:54 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 12:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:54 pm, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Hello,
 
  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and
 suddenly
  the
  crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses
  100% cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
 
  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart
  corosync, then
  start pacemeker. Stoping pacemaker and corosync does not work
  in most of the cases, usually a kill -9 is needed.
 
  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
  Using udpu as transport, two rings on Gigabit ETH, rro_mode
  passive.
 
  Logs are usually flooded with CPG related messages, such as:
 
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
 
  OR
 
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly
  confused
  state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
 
 
  As I wrote I use Ubuntu trusty, the exact package versions are:
 
  corosync 2.3.0-1ubuntu5
  pacemaker 1.1.10+git20130802-1ubuntu2
 
  Ah sorry, I seem to have missed that part.
 
 
  There are no updates available. The only option is to install
  from sources,
  but that would be very difficult to maintain and I'm not sure I
  would get rid of this issue.
 
  What do you recommend?
 
  The same thing as Lars, or switch to a distro that stays current
  with upstream (git shows 5 newer releases for that branch since
  it was released 3 years ago).
  If you do build from source, its probably best to go with v1.4.6
 
  Hm, I am a bit confused here. We are using 2.3.0,
 
  I swapped the 2 for a 1 somehow. A bit distracted, sorry.
 
  I upgraded all nodes to 2.3.3 and first it seemed a bit better, but
  still the
  same issue - after some time CPU gets to 100%, and the corosync log
  is flooded with messages like:
 
  Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0 CPG
  messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (51 remaining, last=3995): Try again (6)
  Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0 CPG
  messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (51 remaining, last=3995): Try again (6)
  Mar 12 07:36:57 [4793] ctdb2cib: info

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Jan Friesse
Attila Megyeri napsal(a):
 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 2:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Attila Megyeri napsal(a):
 Hello Jan,

 Thank you very much for your help so far.

 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 9:51 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Attila Megyeri napsal(a):

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 10:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 12 Mar 2014, at 1:54 am, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:


 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 12:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:54 pm, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:

 Thanks for the quick response!

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Friday, March 07, 2014 3:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:31 am, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:

 Hello,

 We have a strange issue with Corosync/Pacemaker.
 From time to time, something unexpected happens and
 suddenly
 the
 crm_mon output remains static.
 When I check the cpu usage, I see that one of the cores uses
 100% cpu, but
 cannot actually match it to either the corosync or one of the
 pacemaker processes.

 In such a case, this high CPU usage is happening on all 7 nodes.
 I have to manually go to each node, stop pacemaker, restart
 corosync, then
 start pacemeker. Stoping pacemaker and corosync does not work
 in most of the cases, usually a kill -9 is needed.

 Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.

 Using udpu as transport, two rings on Gigabit ETH, rro_mode
 passive.

 Logs are usually flooded with CPG related messages, such as:

 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)

 OR

 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (

 That is usually a symptom of corosync getting into a horribly
 confused
 state.
 Version? Distro? Have you checked for an update?
 Odd that the user of all that CPU isn't showing up though.



 As I wrote I use Ubuntu trusty, the exact package versions are:

 corosync 2.3.0-1ubuntu5
 pacemaker 1.1.10+git20130802-1ubuntu2

 Ah sorry, I seem to have missed that part.


 There are no updates available. The only option is to install
 from sources,
 but that would be very difficult to maintain and I'm not sure I
 would get rid of this issue.

 What do you recommend?

 The same thing as Lars, or switch to a distro that stays current
 with upstream (git shows 5 newer releases for that branch since
 it was released 3 years ago).
 If you do build from source, its probably best to go with v1.4.6

 Hm, I am a bit confused here. We are using 2.3.0,

 I swapped the 2 for a 1 somehow. A bit distracted, sorry.

 I upgraded all nodes to 2.3.3 and first it seemed a bit better, but
 still the
 same issue - after some time CPU gets to 100%, and the corosync log
 is flooded with messages like:

 Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:57 [4798] ctdb2   crmd

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Attila Megyeri


 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 4:31 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Attila Megyeri napsal(a):
  -Original Message-
  From: Jan Friesse [mailto:jfrie...@redhat.com]
  Sent: Wednesday, March 12, 2014 2:27 PM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  Attila Megyeri napsal(a):
  Hello Jan,
 
  Thank you very much for your help so far.
 
  -Original Message-
  From: Jan Friesse [mailto:jfrie...@redhat.com]
  Sent: Wednesday, March 12, 2014 9:51 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
  Attila Megyeri napsal(a):
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 10:27 PM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 12 Mar 2014, at 1:54 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Tuesday, March 11, 2014 12:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:54 pm, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri
  amegy...@minerva-soft.com
  wrote:
 
  Hello,
 
  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and
  suddenly
  the
  crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses
  100% cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
 
  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart
  corosync, then
  start pacemeker. Stoping pacemaker and corosync does not
 work
  in most of the cases, usually a kill -9 is needed.
 
  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
  Using udpu as transport, two rings on Gigabit ETH, rro_mode
  passive.
 
  Logs are usually flooded with CPG related messages, such as:
 
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
  Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
  Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
  Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
  Sent
  0
  CPG
  messages  (1 remaining, last=8): Try again (6)
 
  OR
 
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0
  CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly
  confused
  state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
 
 
  As I wrote I use Ubuntu trusty, the exact package versions are:
 
  corosync 2.3.0-1ubuntu5
  pacemaker 1.1.10+git20130802-1ubuntu2
 
  Ah sorry, I seem to have missed that part.
 
 
  There are no updates available. The only option is to install
  from sources,
  but that would be very difficult to maintain and I'm not sure I
  would get rid of this issue.
 
  What do you recommend?
 
  The same thing as Lars, or switch to a distro that stays
  current with upstream (git shows 5 newer releases for that
  branch since it was released 3 years ago).
  If you do build from source, its probably best to go with
  v1.4.6
 
  Hm, I am a bit confused here. We are using 2.3.0,
 
  I swapped the 2 for a 1 somehow. A bit distracted, sorry.
 
  I upgraded all nodes to 2.3.3 and first it seemed a bit better,
  but still the
  same issue - after some time CPU gets to 100%, and the corosync log
  is flooded with messages like:
 
  Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
  Sent 0
 CPG
  messages  (48 remaining, last=3671): Try again (6)
  Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:
  Sent 0
  CPG
  messages  (51 remaining, last=3995): Try again (6)
  Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-11 Thread Attila Megyeri

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 12:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
  Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com
  wrote:
 
  Hello,
 
  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and suddenly the
  crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses 100%
  cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
 
  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart
  corosync, then
  start pacemeker. Stoping pacemaker and corosync does not work in most
  of the cases, usually a kill -9 is needed.
 
  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
  Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
 
  Logs are usually flooded with CPG related messages, such as:
 
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
 
  OR
 
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly confused
 state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
 
 
  As I wrote I use Ubuntu trusty, the exact package versions are:
 
  corosync 2.3.0-1ubuntu5
  pacemaker 1.1.10+git20130802-1ubuntu2
 
 Ah sorry, I seem to have missed that part.
 
 
  There are no updates available. The only option is to install from sources,
 but that would be very difficult to maintain and I'm not sure I would get rid 
 of
 this issue.
 
  What do you recommend?
 
 The same thing as Lars, or switch to a distro that stays current with upstream
 (git shows 5 newer releases for that branch since it was released 3 years
 ago).
 If you do build from source, its probably best to go with v1.4.6

Hm, I am a bit confused here. We are using 2.3.0, which was released approx. a 
year ago (you mention 3 years) and you recommend 1.4.6, which is a rather old 
version.
Could you please clarify a bit? :)
Lars recommends 2.3.3 git tree.

I might end up trying both, but just want to make sure I am not 
misunderstanding something badly.

Thank you!








 
 
 
 
  HTOP show something like this (sorted by TIME+ descending):
 
 
 
   1  [100.0%] Tasks: 59, 4
  thr; 2 running
   2  [| 0.7%] Load average: 
  1.00 0.99 1.02
   Mem[ 165/994MB] Uptime: 1
  day, 10:22:03
   Swp[   0/509MB]
 
   PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
   921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58
 /usr/sbin/corosync
  1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 
  /usr/sbin/snmpd
 -
  Lsd -Lf /dev/null -u snmp -g snm
  1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
  /usr/lib/pacemaker/cib
  1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06
  /usr/lib/pacemaker/stonithd
  1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
  /usr/sbin/watchdog
  1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
  /usr/lib/pacemaker/crmd
  1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64
  /usr/lib/pacemaker/lrmd
  1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
  /usr/lib/pacemaker/attrd
  1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
  1250 root   20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: 
  read
 process
  1315 hacluster  20   0 73892  2652

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-11 Thread Andrew Beekhof

On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com wrote:

 
 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 12:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
 Thanks for the quick response!
 
 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Friday, March 07, 2014 3:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
 Hello,
 
 We have a strange issue with Corosync/Pacemaker.
 From time to time, something unexpected happens and suddenly the
 crm_mon output remains static.
 When I check the cpu usage, I see that one of the cores uses 100%
 cpu, but
 cannot actually match it to either the corosync or one of the
 pacemaker processes.
 
 In such a case, this high CPU usage is happening on all 7 nodes.
 I have to manually go to each node, stop pacemaker, restart
 corosync, then
 start pacemeker. Stoping pacemaker and corosync does not work in most
 of the cases, usually a kill -9 is needed.
 
 Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
 Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
 
 Logs are usually flooded with CPG related messages, such as:
 
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 
 OR
 
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
 
 That is usually a symptom of corosync getting into a horribly confused
 state.
 Version? Distro? Have you checked for an update?
 Odd that the user of all that CPU isn't showing up though.
 
 
 
 As I wrote I use Ubuntu trusty, the exact package versions are:
 
 corosync 2.3.0-1ubuntu5
 pacemaker 1.1.10+git20130802-1ubuntu2
 
 Ah sorry, I seem to have missed that part.
 
 
 There are no updates available. The only option is to install from sources,
 but that would be very difficult to maintain and I'm not sure I would get 
 rid of
 this issue.
 
 What do you recommend?
 
 The same thing as Lars, or switch to a distro that stays current with 
 upstream
 (git shows 5 newer releases for that branch since it was released 3 years
 ago).
 If you do build from source, its probably best to go with v1.4.6
 
 Hm, I am a bit confused here. We are using 2.3.0,

I swapped the 2 for a 1 somehow. A bit distracted, sorry.

 which was released approx. a year ago (you mention 3 years) and you recommend 
 1.4.6, which is a rather old version.
 Could you please clarify a bit? :)
 Lars recommends 2.3.3 git tree.
 
 I might end up trying both, but just want to make sure I am not 
 misunderstanding something badly.
 
 Thank you!
 
 
 
 
 
 
 
 
 
 
 
 
 HTOP show something like this (sorted by TIME+ descending):
 
 
 
 1  [100.0%] Tasks: 59, 4
 thr; 2 running
 2  [| 0.7%] Load average: 
 1.00 0.99 1.02
 Mem[ 165/994MB] Uptime: 1
 day, 10:22:03
 Swp[   0/509MB]
 
 PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58
 /usr/sbin/corosync
 1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 
 /usr/sbin/snmpd
 -
 Lsd -Lf /dev/null -u snmp -g snm
 1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
 /usr/lib/pacemaker/cib
 1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06
 /usr/lib/pacemaker/stonithd
 1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
 /usr/sbin/watchdog
 1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
 /usr/lib/pacemaker/crmd
 1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64
 /usr/lib/pacemaker/lrmd
 1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
 /usr/lib/pacemaker/attrd
 1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
 1250 root   20   0 33000  1192   928 S  0.0  0.1

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-10 Thread Andrew Beekhof

On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com wrote:

 Thanks for the quick response!
 
 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Friday, March 07, 2014 3:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
 Hello,
 
 We have a strange issue with Corosync/Pacemaker.
 From time to time, something unexpected happens and suddenly the
 crm_mon output remains static.
 When I check the cpu usage, I see that one of the cores uses 100% cpu, but
 cannot actually match it to either the corosync or one of the pacemaker
 processes.
 
 In such a case, this high CPU usage is happening on all 7 nodes.
 I have to manually go to each node, stop pacemaker, restart corosync, then
 start pacemeker. Stoping pacemaker and corosync does not work in most of
 the cases, usually a kill -9 is needed.
 
 Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
 Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
 
 Logs are usually flooded with CPG related messages, such as:
 
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0 CPG
 messages  (1 remaining, last=8): Try again (6)
 
 OR
 
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
 
 That is usually a symptom of corosync getting into a horribly confused state.
 Version? Distro? Have you checked for an update?
 Odd that the user of all that CPU isn't showing up though.
 
 
 
 As I wrote I use Ubuntu trusty, the exact package versions are:
 
 corosync 2.3.0-1ubuntu5
 pacemaker 1.1.10+git20130802-1ubuntu2

Ah sorry, I seem to have missed that part.

 
 There are no updates available. The only option is to install from sources, 
 but that would be very difficult to maintain and I'm not sure I would get rid 
 of this issue.
 
 What do you recommend?

The same thing as Lars, or switch to a distro that stays current with upstream 
(git shows 5 newer releases for that branch since it was released 3 years ago).
If you do build from source, its probably best to go with v1.4.6

 
 
 
 HTOP show something like this (sorted by TIME+ descending):
 
 
 
  1  [100.0%] Tasks: 59, 4
 thr; 2 running
  2  [| 0.7%] Load average: 1.00 
 0.99 1.02
  Mem[ 165/994MB] Uptime: 1
 day, 10:22:03
  Swp[   0/509MB]
 
  PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
  921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58 
 /usr/sbin/corosync
 1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 
 /usr/sbin/snmpd -
 Lsd -Lf /dev/null -u snmp -g snm
 1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
 /usr/lib/pacemaker/cib
 1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06
 /usr/lib/pacemaker/stonithd
 1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
 /usr/sbin/watchdog
 1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
 /usr/lib/pacemaker/crmd
 1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64
 /usr/lib/pacemaker/lrmd
 1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
 /usr/lib/pacemaker/attrd
 1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
 1250 root   20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: read 
 process
 1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25
 /usr/lib/pacemaker/pengine
 1252 root   20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd: 
 write process
 1835 ntp20   0 27216  1980  1408 S  0.0  0.2  0:11.80 
 /usr/sbin/ntpd -p
 /var/run/ntpd.pid -g -u 105:112
  899 root   20   0 19168   700   488 S  0.0  0.1  0:09.75 
 /usr/sbin/irqbalance
 1642 root   20   0 30696  1556   912 S  0.0  0.2  0:06.49 
 /usr/bin/monit -c
 /etc/monit/monitrc
 4374 kamailio   20   0  291M  7272  2188 S  0.0  0.7  0:02.77
 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
 3079 root0 -20 16864  4592  3508 S  0.0  0.5  0:01.51 /usr/bin/atop 
 -a -w
 /var/log/atop/atop_20140306 6

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-09 Thread Lars Marowsky-Bree
On 2014-03-07T09:08:41, Attila Megyeri amegy...@minerva-soft.com wrote:

 One more thing to add. I did an apt-get upgrade on one of the nodes, and then 
 restarted the node. It resulted in this state on all other nodes again...

2.3.0 is not the most recent corosync version. 2.3.3 (and possibly the
git tree) contain quite a number of important fixes.

I'd suggest to ask Ubuntu for an update - or to submit one yourself,
community distributions welcome contributors ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-07 Thread Attila Megyeri
One more thing to add. I did an apt-get upgrade on one of the nodes, and then 
restarted the node. It resulted in this state on all other nodes again...

 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: Friday, March 07, 2014 7:54 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 Thanks for the quick response!
 
  -Original Message-
  From: Andrew Beekhof [mailto:and...@beekhof.net]
  Sent: Friday, March 07, 2014 3:48 AM
  To: The Pacemaker cluster resource manager
  Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
  On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com
  wrote:
 
   Hello,
  
   We have a strange issue with Corosync/Pacemaker.
   From time to time, something unexpected happens and suddenly the
  crm_mon output remains static.
   When I check the cpu usage, I see that one of the cores uses 100%
   cpu, but
  cannot actually match it to either the corosync or one of the
  pacemaker processes.
  
   In such a case, this high CPU usage is happening on all 7 nodes.
   I have to manually go to each node, stop pacemaker, restart
   corosync, then
  start pacemeker. Stoping pacemaker and corosync does not work in most
  of the cases, usually a kill -9 is needed.
  
   Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
  
   Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
  
   Logs are usually flooded with CPG related messages, such as:
  
   Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
   Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
   Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
   Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
   Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
   Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
   Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
   Sent 0
 CPG
  messages  (1 remaining, last=8): Try again (6)
  
   OR
  
   Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
   Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
   Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
   Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
   Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
   Sent 0 CPG
  messages  (1 remaining, last=10933): Try again (
 
  That is usually a symptom of corosync getting into a horribly confused 
  state.
  Version? Distro? Have you checked for an update?
  Odd that the user of all that CPU isn't showing up though.
 
  
 
 As I wrote I use Ubuntu trusty, the exact package versions are:
 
 corosync 2.3.0-1ubuntu5
 pacemaker 1.1.10+git20130802-1ubuntu2
 
 There are no updates available. The only option is to install from sources, 
 but
 that would be very difficult to maintain and I'm not sure I would get rid of 
 this
 issue.
 
 What do you recommend?
 
 
  
   HTOP show something like this (sorted by TIME+ descending):
  
  
  
 1  [100.0%] Tasks: 59, 4
  thr; 2 running
 2  [| 0.7%] Load average: 
   1.00 0.99 1.02
 Mem[ 165/994MB] Uptime: 1
  day, 10:22:03
 Swp[   0/509MB]
  
 PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
 921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58
 /usr/sbin/corosync
   1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 
   /usr/sbin/snmpd -
  Lsd -Lf /dev/null -u snmp -g snm
   1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
  /usr/lib/pacemaker/cib
   1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06
  /usr/lib/pacemaker/stonithd
   1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
   /usr/sbin/watchdog
   1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
  /usr/lib/pacemaker/crmd
   1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64
  /usr/lib/pacemaker/lrmd
   1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
  /usr/lib/pacemaker/attrd
   1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
   1250 root   20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: 
   read
 process
   1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25
  /usr/lib/pacemaker/pengine
   1252 root   20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd: 
   write
 process
   1835 ntp20   0 27216  1980  1408 S  0.0  0.2  0:11.80 
   /usr/sbin/ntpd -p
  /var/run/ntpd.pid -g -u 105:112
 899 root   20   0 19168   700   488 S  0.0  0.1  0:09.75 
   /usr/sbin/irqbalance
   1642 root   20   0 30696  1556   912 S  0.0  0.2  0:06.49 
   /usr/bin/monit -c
  /etc/monit/monitrc
   4374 kamailio   20   0

[Pacemaker] Pacemaker/corosync freeze

2014-03-06 Thread Attila Megyeri
Hello,

We have a strange issue with Corosync/Pacemaker.
From time to time, something unexpected happens and suddenly the crm_mon 
output remains static.
When I check the cpu usage, I see that one of the cores uses 100% cpu, but 
cannot actually match it to either the corosync or one of the pacemaker 
processes.

In such a case, this high CPU usage is happening on all 7 nodes.
I have to manually go to each node, stop pacemaker, restart corosync, then 
start pacemeker. Stoping pacemaker and corosync does not work in most of the 
cases, usually a kill -9 is needed.

Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.

Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.

Logs are usually flooded with CPG related messages, such as:

Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   Sent 0 
CPG messages  (1 remaining, last=8): Try again (6)
Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   Sent 0 
CPG messages  (1 remaining, last=8): Try again (6)
Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   Sent 0 
CPG messages  (1 remaining, last=8): Try again (6)
Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   Sent 0 
CPG messages  (1 remaining, last=8): Try again (6)

OR

Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 0 
CPG messages  (1 remaining, last=10933): Try again (
Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 0 
CPG messages  (1 remaining, last=10933): Try again (
Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 0 
CPG messages  (1 remaining, last=10933): Try again (


HTOP show something like this (sorted by TIME+ descending):



  1  [100.0%] Tasks: 59, 4 thr; 2 
running
  2  [| 0.7%] Load average: 1.00 
0.99 1.02
  Mem[ 165/994MB] Uptime: 1 day, 
10:22:03
  Swp[   0/509MB]

  PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
  921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58 
/usr/sbin/corosync
1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 /usr/sbin/snmpd 
-Lsd -Lf /dev/null -u snmp -g snm
1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71 
/usr/lib/pacemaker/cib
1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06 
/usr/lib/pacemaker/stonithd
1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 /usr/sbin/watchdog
1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62 
/usr/lib/pacemaker/crmd
1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64 
/usr/lib/pacemaker/lrmd
1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01 
/usr/lib/pacemaker/attrd
1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
1250 root   20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: read 
process
1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25 
/usr/lib/pacemaker/pengine
1252 root   20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd: write 
process
1835 ntp20   0 27216  1980  1408 S  0.0  0.2  0:11.80 /usr/sbin/ntpd -p 
/var/run/ntpd.pid -g -u 105:112
  899 root   20   0 19168   700   488 S  0.0  0.1  0:09.75 
/usr/sbin/irqbalance
1642 root   20   0 30696  1556   912 S  0.0  0.2  0:06.49 /usr/bin/monit -c 
/etc/monit/monitrc
4374 kamailio   20   0  291M  7272  2188 S  0.0  0.7  0:02.77 
/usr/local/sbin/kamailio -f /etc/kamailio/kamaili
3079 root0 -20 16864  4592  3508 S  0.0  0.5  0:01.51 /usr/bin/atop -a 
-w /var/log/atop/atop_20140306 6
  445 syslog 20   0  249M  6276   976 S  0.0  0.6  0:01.16 rsyslogd
4373 kamailio   20   0  291M  7492  2396 S  0.0  0.7  0:01.03 
/usr/local/sbin/kamailio -f /etc/kamailio/kamaili
1 root   20   0 33376  2632  1404 S  0.0  0.3  0:00.63 /sbin/init
  453 syslog 20   0  249M  6276   976 S  0.0  0.6  0:00.63 rsyslogd
  451 syslog 20   0  249M  6276   976 S  0.0  0.6  0:00.53 rsyslogd
4379 kamailio   20   0  291M  6224  1132 S  0.0  0.6  0:00.38 
/usr/local/sbin/kamailio -f /etc/kamailio/kamaili
4380 kamailio   20   0  291M  8516  3084 S  0.0  0.8  0:00.38 
/usr/local/sbin/kamailio -f /etc/kamailio/kamaili
4381 kamailio   20   0  291M  8252  2828 S  0.0  0.8  0:00.37 
/usr/local/sbin/kamailio -f /etc/kamailio/kamaili
23315 root   20   0 24872  2476  1412 R  0.7  0.2  0:00.37 htop
4367 kamailio   20   0  291M 1  4864 S  0.0  1.0  0:00.36 
/usr/local/sbin/kamailio -f /etc/kamailio/kamaili


My questions:

-   Is this a cororync or pacameker issue?

-   What are the CPG messages? Is it possible that we have a firewall issue?


Any hints would be great!

Thanks,
Attila
___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: 

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-06 Thread Andrew Beekhof

On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com wrote:

 Hello,
  
 We have a strange issue with Corosync/Pacemaker.
 From time to time, something unexpected happens and suddenly the crm_mon 
 output remains static.
 When I check the cpu usage, I see that one of the cores uses 100% cpu, but 
 cannot actually match it to either the corosync or one of the pacemaker 
 processes.
  
 In such a case, this high CPU usage is happening on all 7 nodes.
 I have to manually go to each node, stop pacemaker, restart corosync, then 
 start pacemeker. Stoping pacemaker and corosync does not work in most of the 
 cases, usually a kill -9 is needed.
  
 Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
  
 Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
  
 Logs are usually flooded with CPG related messages, such as:
  
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   Sent 
 0 CPG messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   Sent 
 0 CPG messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   Sent 
 0 CPG messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   Sent 
 0 CPG messages  (1 remaining, last=8): Try again (6)
  
 OR
  
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 
 0 CPG messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 
 0 CPG messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:Sent 
 0 CPG messages  (1 remaining, last=10933): Try again (

That is usually a symptom of corosync getting into a horribly confused state.  
Version? Distro? Have you checked for an update?
Odd that the user of all that CPU isn't showing up though.

  
  
 HTOP show something like this (sorted by TIME+ descending):
  
  
  
   1  [100.0%] Tasks: 59, 4 thr; 2 
 running
   2  [| 0.7%] Load average: 1.00 
 0.99 1.02
   Mem[ 165/994MB] Uptime: 1 day, 
 10:22:03
   Swp[   0/509MB]
  
   PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
   921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58 
 /usr/sbin/corosync
 1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 /usr/sbin/snmpd 
 -Lsd -Lf /dev/null -u snmp -g snm
 1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71 
 /usr/lib/pacemaker/cib
 1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06 
 /usr/lib/pacemaker/stonithd
 1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
 /usr/sbin/watchdog
 1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62 
 /usr/lib/pacemaker/crmd
 1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64 
 /usr/lib/pacemaker/lrmd
 1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01 
 /usr/lib/pacemaker/attrd
 1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
 1250 root   20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: read 
 process
 1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25 
 /usr/lib/pacemaker/pengine
 1252 root   20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd: write 
 process
 1835 ntp20   0 27216  1980  1408 S  0.0  0.2  0:11.80 /usr/sbin/ntpd 
 -p /var/run/ntpd.pid -g -u 105:112
   899 root   20   0 19168   700   488 S  0.0  0.1  0:09.75 
 /usr/sbin/irqbalance
 1642 root   20   0 30696  1556   912 S  0.0  0.2  0:06.49 /usr/bin/monit 
 -c /etc/monit/monitrc
 4374 kamailio   20   0  291M  7272  2188 S  0.0  0.7  0:02.77 
 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
 3079 root0 -20 16864  4592  3508 S  0.0  0.5  0:01.51 /usr/bin/atop 
 -a -w /var/log/atop/atop_20140306 6
   445 syslog 20   0  249M  6276   976 S  0.0  0.6  0:01.16 rsyslogd
 4373 kamailio   20   0  291M  7492  2396 S  0.0  0.7  0:01.03 
 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
 1 root   20   0 33376  2632  1404 S  0.0  0.3  0:00.63 /sbin/init
   453 syslog 20   0  249M  6276   976 S  0.0  0.6  0:00.63 rsyslogd
   451 syslog 20   0  249M  6276   976 S  0.0  0.6  0:00.53 rsyslogd
 4379 kamailio   20   0  291M  6224  1132 S  0.0  0.6  0:00.38 
 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
 4380 kamailio   20   0  291M  8516  3084 S  0.0  0.8  0:00.38 
 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
 4381 kamailio   20   0  291M  8252  2828 S  0.0  0.8  0:00.37 
 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
 23315 root   20   0 24872  2476  1412 R  0.7  0.2  0:00.37 htop
 4367 kamailio   20   0  291M 1  4864 S  0.0  1.0  0:00.36 
 /usr/local/sbin/kamailio -f 

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-06 Thread Attila Megyeri
Thanks for the quick response!

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Friday, March 07, 2014 3:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze
 
 
 On 7 Mar 2014, at 5:31 am, Attila Megyeri amegy...@minerva-soft.com
 wrote:
 
  Hello,
 
  We have a strange issue with Corosync/Pacemaker.
  From time to time, something unexpected happens and suddenly the
 crm_mon output remains static.
  When I check the cpu usage, I see that one of the cores uses 100% cpu, but
 cannot actually match it to either the corosync or one of the pacemaker
 processes.
 
  In such a case, this high CPU usage is happening on all 7 nodes.
  I have to manually go to each node, stop pacemaker, restart corosync, then
 start pacemeker. Stoping pacemaker and corosync does not work in most of
 the cases, usually a kill -9 is needed.
 
  Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.
 
  Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.
 
  Logs are usually flooded with CPG related messages, such as:
 
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0 CPG
 messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0 CPG
 messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0 CPG
 messages  (1 remaining, last=8): Try again (6)
  Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
  Sent 0 CPG
 messages  (1 remaining, last=8): Try again (6)
 
  OR
 
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
  Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
  Sent 0 CPG
 messages  (1 remaining, last=10933): Try again (
 
 That is usually a symptom of corosync getting into a horribly confused state.
 Version? Distro? Have you checked for an update?
 Odd that the user of all that CPU isn't showing up though.
 
 

As I wrote I use Ubuntu trusty, the exact package versions are:

corosync 2.3.0-1ubuntu5
pacemaker 1.1.10+git20130802-1ubuntu2

There are no updates available. The only option is to install from sources, but 
that would be very difficult to maintain and I'm not sure I would get rid of 
this issue.

What do you recommend?


 
  HTOP show something like this (sorted by TIME+ descending):
 
 
 
1  [100.0%] Tasks: 59, 4
 thr; 2 running
2  [| 0.7%] Load average: 
  1.00 0.99 1.02
Mem[ 165/994MB] Uptime: 1
 day, 10:22:03
Swp[   0/509MB]
 
PID USER  PRI  NI  VIRT   RES   SHR S CPU% MEM%   TIME+  Command
921 root   20   0  188M 49220 33856 R  0.0  4.8  3h33:58 
  /usr/sbin/corosync
  1277 snmp   20   0 45708  4248  1472 S  0.0  0.4  1:33.07 
  /usr/sbin/snmpd -
 Lsd -Lf /dev/null -u snmp -g snm
  1311 hacluster  20   0  109M 16160  9640 S  0.0  1.6  1:12.71
 /usr/lib/pacemaker/cib
  1312 root   20   0  104M  7484  3780 S  0.0  0.7  0:38.06
 /usr/lib/pacemaker/stonithd
  1611 root   -2   0  4408  2356  2000 S  0.0  0.2  0:24.15 
  /usr/sbin/watchdog
  1316 hacluster  20   0  122M  9756  5924 S  0.0  1.0  0:22.62
 /usr/lib/pacemaker/crmd
  1313 root   20   0 81784  3800  2876 S  0.0  0.4  0:18.64
 /usr/lib/pacemaker/lrmd
  1314 hacluster  20   0 96616  4132  2604 S  0.0  0.4  0:16.01
 /usr/lib/pacemaker/attrd
  1309 root   20   0  104M  4804  2580 S  0.0  0.5  0:15.56 pacemakerd
  1250 root   20   0 33000  1192   928 S  0.0  0.1  0:13.59 ha_logd: read 
  process
  1315 hacluster  20   0 73892  2652  1952 S  0.0  0.3  0:13.25
 /usr/lib/pacemaker/pengine
  1252 root   20   0 33000   712   456 S  0.0  0.1  0:13.03 ha_logd: 
  write process
  1835 ntp20   0 27216  1980  1408 S  0.0  0.2  0:11.80 
  /usr/sbin/ntpd -p
 /var/run/ntpd.pid -g -u 105:112
899 root   20   0 19168   700   488 S  0.0  0.1  0:09.75 
  /usr/sbin/irqbalance
  1642 root   20   0 30696  1556   912 S  0.0  0.2  0:06.49 
  /usr/bin/monit -c
 /etc/monit/monitrc
  4374 kamailio   20   0  291M  7272  2188 S  0.0  0.7  0:02.77
 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
  3079 root0 -20 16864  4592  3508 S  0.0  0.5  0:01.51 /usr/bin/atop 
  -a -w
 /var/log/atop/atop_20140306 6
445 syslog 20   0  249M  6276   976 S  0.0  0.6  0:01.16 rsyslogd
  4373 kamailio   20   0  291M  7492  2396 S  0.0  0.7  0:01.03
 /usr/local/sbin/kamailio -f /etc/kamailio/kamaili
  1 root   20   0 33376  2632  1404 S  0.0  0.3  0:00.63 /sbin/init
453 syslog 20   0  249M