Re: [Pacemaker] [corosync] active/active with Radius

2015-02-16 Thread Jan Friesse
This is really question for pacemaker list, so CCing.

Regards,
  Honza


 Hi,
 
   I would like Corosync to manage Radius in an active/active
 configuration but I don't know how I should add this, so was wondering
 if somebody could point me in the right direction.
 
 Thanks and kind regards,
 Soph.
 
 -- Details --
 
 So far I have this,
  # crm configure show
 node centos6-radius0-kawazu
 node centos6-radius1-yetti
 primitive failover-ip ocf:heartbeat:IPaddr \
 params ip=192.168.10.200 \
 op monitor interval=2s
 property $id=cib-bootstrap-options \
 dc-version=1.1.10-14.el6_5.2-368c726 \
 cluster-infrastructure=classic openais (with plugin) \
 expected-quorum-votes=2 \
 stonith-enabled=false \
 no-quorum-policy=ignore
 
 And wondered if I should add this:
 # crm configure primitive RADIUS lsb:radiusd op monitor interval=5s
 timeout=20s start-delay=0s
 If I could add ocf:heartbeat then may be better, but I read this mayn't
 work when raddb forked. ( Reference :
 http://oss.clusterlabs.org/pipermail/pacemaker/2012-April/013790.html )
 
 If not then how should I configure this?
 
 My O/S is CentOS 6.
 
 -- End --
 ___
 discuss mailing list
 disc...@corosync.org
 http://lists.corosync.org/mailman/listinfo/discuss
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Openais] Issues with a squid cluster.

2015-02-10 Thread Jan Friesse
This is really question for pacemaker list, so CCing.

Regards,
  Honza

Redeye napsal(a):
 I am not certain where I should post this, hopefully someone will point me in 
 the right direction.
 
 I have a two node cluster on Ubuntu 12.04, corosync, pacemaker, and squid.  
 Squid is not starting at boot, pacemaker is controlling that.  The two 
 servers are communicating just fine, pacemaker starts, stops, and monitors 
 the squid resources just fine too.  My problem is that I am unable to do 
 anything with the squid instances.  For example, I want to update an acl, and 
 I want to bounce the squid service to load the new settings.  Service squid3 
 stop|start|status|restart|etc does nothing, it returns unknown instance.  Ps 
 -af |grep squid shows two instances, one user root one user proxy, and squid 
 is doing what it is supposed to.  
 
 What can I do to remedy this?
 ___
 Openais mailing list
 open...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/openais
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Openais] problem to delete resource

2015-02-04 Thread Jan Friesse

This is really question for pacemaker list, so CCing.

Regards,
  Honza

Vladimir Berezovski (vberezov) napsal(a):

Hi ,

I added a new resourse like

crm(live)configure# primitive p_drbd_ora ocf:linbit:drbd params 
drbd_resource=clusterdb_res_ora op monitor interval=60s


but  its status is FAILED(unmanaged) . I tried to stop and delete it but to 
no result - it's still running .How to  manage this issue ?


[root@node1 ~]#  crm configure show
node node1 \
 attributes standby=off
node node2
primitive p_drbd_ora ocf:linbit:drbd \
 params drbd_resource=clusterdb_res_ora \
 op monitor interval=60s \
 meta target-role=Stopped is-managed=true
property cib-bootstrap-options: \
 dc-version=1.1.11-97629de \
 cluster-infrastructure=classic openais (with plugin) \
 expected-quorum-votes=2 \
 stonith-enabled=false \
 no-quorum-policy=ignore \
 last-lrm-refresh=1422887129
rsc_defaults rsc-options: \
 resource-stickiness=100


[root@node1 ~]# crm_mon -1
Last updated: Mon Feb  2 17:12:40 2015
Last change: Mon Feb  2 16:44:52 2015
Stack: classic openais (with plugin)
Current DC: node1 - partition WITHOUT quorum
Version: 1.1.11-97629de
2 Nodes configured, 2 expected votes
1 Resources configured


Online: [ node1 ]
OFFLINE: [ node2 ]

p_drbd_ora (ocf::linbit:drbd): FAILED node1 (unmanaged)

Failed actions:
 p_drbd_ora_stop_0 on node1 'not configured' (6): call=6, status=complete, 
last-rc-change='Mon Feb  2 16:54:19 2015', queued=0ms, exec=26ms
 p_drbd_ora_stop_0 on node1 'not configured' (6): call=6, status=complete, 
last-rc-change='Mon Feb  2 16:54:19 2015', queued=0ms, exec=26ms



#crm resource stop  p_drbd_ora

[root@node1 ~]# crm configure delete p_drbd_ora
ERROR: resource p_drbd_ora is running, can't delete it

Regards ,


Vladimir Berezovski



___
Openais mailing list
open...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/openais




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-20 Thread Jan Friesse
Kostiantyn,


 One more thing to clarify.
 You said rebind can be avoided - what does it mean?

By that I mean that as long as you don't shutdown interface everything
will work as expected. Interface shutdown is administrator decision,
system doesn't do it automagically :)

Regards,
  Honza

 
 Thank you,
 Kostya
 
 On Wed, Jan 14, 2015 at 1:31 PM, Kostiantyn Ponomarenko 
 konstantin.ponomare...@gmail.com wrote:
 
 Thank you. Now I am aware of it.

 Thank you,
 Kostya

 On Wed, Jan 14, 2015 at 12:59 PM, Jan Friesse jfrie...@redhat.com wrote:

 Kostiantyn,

 Honza,

 Thank you for helping me.
 So, there is no defined behavior in case one of the interfaces is not in
 the system?

 You are right. There is no defined behavior.

 Regards,
   Honza




 Thank you,
 Kostya

 On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com
 wrote:

 Kostiantyn,


 According to the https://access.redhat.com/solutions/638843 , the
 interface, that is defined in the corosync.conf, must be present in
 the
 system (see at the bottom of the article, section ROOT CAUSE).
 To confirm that I made a couple of tests.

 Here is a part of the corosync.conf file (in a free-write form) (also
 attached the origin config file):
 ===
 rrp_mode: passive
 ring0_addr is defined in corosync.conf
 ring1_addr is defined in corosync.conf
 ===

 ---

 Two-node cluster

 ---

 Test #1:
 --
 IP for ring0 is not defines in the system:
 --
 Start Corosync simultaneously on both nodes.
 Corosync fails to start.
 From the logs:
 Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
 config: No interfaces defined
 Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync
 Cluster
 Engine exiting with status 8 at main.c:1343.
 Result: Corosync and Pacemaker are not running.

 Test #2:
 --
 IP for ring1 is not defines in the system:
 --
 Start Corosync simultaneously on both nodes.
 Corosync starts.
 Start Pacemaker simultaneously on both nodes.
 Pacemaker fails to start.
 From the logs, the last writes from the corosync:
 Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid
 0
 interface 169.254.1.3 FAULTY
 Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ]
 Automatically
 recovered ring 0
 Result: Corosync and Pacemaker are not running.


 Test #3:

 rrp_mode: active leads to the same result, except Corosync and
 Pacemaker
 init scripts return status running.
 But still vim /var/log/cluster/corosync.log shows a lot of errors
 like:
 Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch:
 Connection
 to the CPG API failed: Library error (2)

 Result: Corosync and Pacemaker show their statuses as running, but
 crm_mon cannot connect to the cluster database. And half of the
 Pacemaker's services are not running (including Cluster Information
 Base
 (CIB)).


 ---

 For a single node mode

 ---

 IP for ring0 is not defines in the system:

 Corosync fails to start.

 IP for ring1 is not defines in the system:

 Corosync and Pacemaker are started.

 It is possible that configuration will be applied successfully (50%),

 and it is possible that the cluster is not running any resources,

 and it is possible that the node cannot be put in a standby mode
 (shows:
 communication error),

 and it is possible that the cluster is running all resources, but
 applied
 configuration is not guaranteed to be fully loaded (some rules can be
 missed).


 ---

 Conclusions:

 ---

 It is possible that in some rare cases (see comments to the bug) the
 cluster will work, but in that case its working state is unstable and
 the
 cluster can stop working every moment.


 So, is it correct? Does my assumptions make any sense? I didn't any
 other
 explanation in the network ... .

 Corosync needs all interfaces during start and runtime. This doesn't
 mean they must be connected (this would make corosync unusable for
 physical NIC/Switch or cable failure), but they must be up and have
 correct ip.

 When this is not the case, corosync rebinds to localhost and weird
 things happens. Removal of this rebinding is long time TODO, but there
 are still more important bugs (especially because rebind can be
 avoided).

 Regards,
   Honza




 Thank you,
 Kostya

 On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko 
 konstantin.ponomare...@gmail.com wrote:

 Hi guys,

 Corosync fails to start if there is no such network interface
 configured
 in the system.
 Even with rrp_mode: passive the problem is the same when at least
 one
 network interface is not configured in the system.

 Is this the expected behavior?
 I

Re: [Pacemaker] [corosync] CoroSync's UDPu transport for public IP addresses?

2015-01-19 Thread Jan Friesse

Dmitry,



Great, it works! Thank you.

It would be extremely helpful if this information will be included in a
default corosync.conf as comments:
- regarding allowed and even preferred absense of totem.interface in case
of UDPu


Yep


- that quorum section must not be empty, and that the default quorum.provider
could be corosync_votequorum (but not empty).


This is not entirely true. quorum.provider cannot be empty string, or 
generally must be valid provider like corosync_votequorum. But 
unspecified quorum.provider works without any problem (as in example 
configuration file). Truth is, that Pacemaker must then be configured in 
a way that quorum is not required.


Regards,
  Honza



It would help to install and launch corosync instantly by novices.


On Fri, Jan 16, 2015 at 7:31 PM, Jan Friesse jfrie...@redhat.com wrote:


Dmitry Koterov napsal(a):




  such messages (for now). But, anyway, DNS names in ringX_addr seem not

working, and no relevant messages are in default logs. Maybe add some
validations for ringX_addr?

I'm having resolvable DNS names:

root@node1:/etc/corosync# ping -c1 -W100 node1 | grep from
64 bytes from node1 (127.0.1.1): icmp_seq=1 ttl=64 time=0.039 ms



This is problem. Resolving node1 to localhost (127.0.0.1) is simply
wrong. Names you want to use in corosync.conf should resolve to
interface address. I believe other nodes has similar setting (so node2
resolved on node2 is again 127.0.0.1)



Wow! What a shame! How could I miss it... So you're absolutely right,
thanks: that was the cause, an entry in /etc/hosts. On some machines I
removed it manually, but on others - didn't. Now I do it automatically
by sed -i -r /^.*[[:space:]]$host([[:space:]]|\$)/d /etc/hosts in the
initialization script.

I apologize for the mess.

So now I have only one place in corosync.conf where I need to specify a
plain IP address for UDPu: totem.interface.bindnetaddr. If I specify
0.0.0.0 there, I'm having a message Service engine 'corosync_quorum'
failed to load for reason 'configuration error: nodelist or
quorum.expected_votes must be configured!' in the logs (BTW it does not
say that I mistaked in bindnetaddr). Is there a way to completely untie
from IP addresses?



You can just remove whole interface section completely. Corosync will find
correct address from nodelist.

Regards,
   Honza






  Please try to fix this problem first and let's see if this will solve

issue you are hitting.

Regards,
Honza

  root@node1:/etc/corosync# ping -c1 -W100 node2 | grep from

64 bytes from node2 (188.166.54.190): icmp_seq=1 ttl=55 time=88.3 ms

root@node1:/etc/corosync# ping -c1 -W100 node3 | grep from
64 bytes from node3 (128.199.116.218): icmp_seq=1 ttl=51 time=252 ms


With corosync.conf below, nothing works:
...
nodelist {
node {
  ring0_addr: node1
}
node {
  ring0_addr: node2
}
node {
  ring0_addr: node3
}
}
...
Jan 14 10:47:44 node1 corosync[15061]:  [MAIN  ] Corosync Cluster Engine
('2.3.3'): started and ready to provide service.
Jan 14 10:47:44 node1 corosync[15061]:  [MAIN  ] Corosync built-in
features: dbus testagents rdma watchdog augeas pie relro bindnow
Jan 14 10:47:44 node1 corosync[15062]:  [TOTEM ] Initializing transport
(UDP/IP Unicast).
Jan 14 10:47:44 node1 corosync[15062]:  [TOTEM ] Initializing
transmit/receive security (NSS) crypto: aes256 hash: sha1
Jan 14 10:47:44 node1 corosync[15062]:  [TOTEM ] The network interface
[a.b.c.d] is now up.
Jan 14 10:47:44 node1 corosync[15062]:  [SERV  ] Service engine loaded:
corosync configuration map access [0]
Jan 14 10:47:44 node1 corosync[15062]:  [QB] server name: cmap
Jan 14 10:47:44 node1 corosync[15062]:  [SERV  ] Service engine loaded:
corosync configuration service [1]
Jan 14 10:47:44 node1 corosync[15062]:  [QB] server name: cfg
Jan 14 10:47:44 node1 corosync[15062]:  [SERV  ] Service engine loaded:
corosync cluster closed process group service v1.01 [2]
Jan 14 10:47:44 node1 corosync[15062]:  [QB] server name: cpg
Jan 14 10:47:44 node1 corosync[15062]:  [SERV  ] Service engine loaded:
corosync profile loading service [4]
Jan 14 10:47:44 node1 corosync[15062]:  [WD] No Watchdog, try


modprobe


a watchdog
Jan 14 10:47:44 node1 corosync[15062]:  [WD] no resources
configured.
Jan 14 10:47:44 node1 corosync[15062]:  [SERV  ] Service engine loaded:
corosync watchdog service [7]
Jan 14 10:47:44 node1 corosync[15062]:  [QUORUM] Using quorum provider
corosync_votequorum
Jan 14 10:47:44 node1 corosync[15062]:  [QUORUM] Quorum provider:
corosync_votequorum failed to initialize.
Jan 14 10:47:44 node1 corosync[15062]:  [SERV  ] Service engine
'corosync_quorum' failed to load for reason 'configuration error:


nodelist


or quorum.expected_votes must be configured!'
Jan 14 10:47:44 node1 corosync[15062]:  [MAIN  ] Corosync Cluster Engine
exiting with status 20 at service.c:356.


But with IP addresses specified in ringX_addr, everything works:
...
nodelist {
node

Re: [Pacemaker] [corosync] CoroSync's UDPu transport for public IP addresses?

2015-01-16 Thread Jan Friesse
:48:28 node1 corosync[15156]:  [SERV  ] Service engine loaded:
corosync configuration service [1]
Jan 14 10:48:28 node1 corosync[15156]:  [QB] server name: cfg
Jan 14 10:48:28 node1 corosync[15156]:  [SERV  ] Service engine loaded:
corosync cluster closed process group service v1.01 [2]
Jan 14 10:48:28 node1 corosync[15156]:  [QB] server name: cpg
Jan 14 10:48:28 node1 corosync[15156]:  [SERV  ] Service engine loaded:
corosync profile loading service [4]
Jan 14 10:48:28 node1 corosync[15156]:  [WD] No Watchdog, try

modprobe

a watchdog
Jan 14 10:48:28 node1 corosync[15156]:  [WD] no resources configured.
Jan 14 10:48:28 node1 corosync[15156]:  [SERV  ] Service engine loaded:
corosync watchdog service [7]
Jan 14 10:48:28 node1 corosync[15156]:  [QUORUM] Using quorum provider
corosync_votequorum
Jan 14 10:48:28 node1 corosync[15156]:  [SERV  ] Service engine loaded:
corosync vote quorum service v1.0 [5]
Jan 14 10:48:28 node1 corosync[15156]:  [QB] server name: votequorum
Jan 14 10:48:28 node1 corosync[15156]:  [SERV  ] Service engine loaded:
corosync cluster quorum service v0.1 [3]
Jan 14 10:48:28 node1 corosync[15156]:  [QB] server name: quorum
Jan 14 10:48:28 node1 corosync[15156]:  [TOTEM ] adding new UDPU member
{a.b.c.d}
Jan 14 10:48:28 node1 corosync[15156]:  [TOTEM ] adding new UDPU member
{e.f.g.h}
Jan 14 10:48:28 node1 corosync[15156]:  [TOTEM ] adding new UDPU member
{i.j.k.l}
Jan 14 10:48:28 node1 corosync[15156]:  [TOTEM ] A new membership
(m.n.o.p:80) was formed. Members joined: 1760315215
Jan 14 10:48:28 node1 corosync[15156]:  [QUORUM] Members[1]: 1760315215
Jan 14 10:48:28 node1 corosync[15156]:  [MAIN  ] Completed service
synchronization, ready to provide service.


On Mon, Jan 5, 2015 at 6:45 PM, Jan Friesse jfrie...@redhat.com wrote:


Dmitry,



Sure, in logs I see adding new UDPU member {IP_ADDRESS} (so DNS names
are definitely resolved), but in practice the cluster does not work,

as I

said above. So validations of ringX_addr in corosync.conf would be very
helpful in corosync.


that's weird. Because as long as DNS is resolved, corosync works only
with IP. This means, code path is exactly same with IP or with DNS. Do
you have logs from corosync?

Honza




On Fri, Jan 2, 2015 at 2:49 PM, Jan Friesse jfrie...@redhat.com

wrote:



Dmitry,


  No, I meant that if you pass a domain name in ring0_addr, there are

no

errors in logs, corosync even seems to find nodes (based on its

logs),

And

crm_node -l shows them, but in practice nothing really works. A

verbose

error message would be very helpful in such case.



This sounds weird. Are you sure that DNS names really maps to correct

IP

address? In logs there should be something like adding new UDPU

member

{IP_ADDRESS}.

Regards,
   Honza



On Tuesday, December 30, 2014, Daniel Dehennin 
daniel.dehen...@baby-gnu.org
wrote:

  Dmitry Koterov dmitry.kote...@gmail.com javascript:; writes:


  Oh, seems I've found the solution! At least two mistakes was in my

corosync.conf (BTW logs did not say about any errors, so my

conclusion

is
based on my experiments only).

1. nodelist.node MUST contain only IP addresses. No hostnames! They


simply


do not work, crm status shows no nodes. And no warnings are in

logs

regarding this.



You can add name like this:

  nodelist {
node {
  ring0_addr: public-ip-address-of-the-first-machine
  name: node1
}
node {
  ring0_addr: public-ip-address-of-the-second-machine
  name: node2
}
  }

I used it on Ubuntu Trusty with udpu.

Regards.

--
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started:

http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started:

http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started:

http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started:

http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

Bugs: http://bugs.clusterlabs.org

Re: [Pacemaker] CoroSync's UDPu transport for public IP addresses?

2015-01-14 Thread Jan Friesse
  ] Service engine loaded:
 corosync watchdog service [7]
 Jan 14 10:48:28 node1 corosync[15156]:  [QUORUM] Using quorum provider
 corosync_votequorum
 Jan 14 10:48:28 node1 corosync[15156]:  [SERV  ] Service engine loaded:
 corosync vote quorum service v1.0 [5]
 Jan 14 10:48:28 node1 corosync[15156]:  [QB] server name: votequorum
 Jan 14 10:48:28 node1 corosync[15156]:  [SERV  ] Service engine loaded:
 corosync cluster quorum service v0.1 [3]
 Jan 14 10:48:28 node1 corosync[15156]:  [QB] server name: quorum
 Jan 14 10:48:28 node1 corosync[15156]:  [TOTEM ] adding new UDPU member
 {a.b.c.d}
 Jan 14 10:48:28 node1 corosync[15156]:  [TOTEM ] adding new UDPU member
 {e.f.g.h}
 Jan 14 10:48:28 node1 corosync[15156]:  [TOTEM ] adding new UDPU member
 {i.j.k.l}
 Jan 14 10:48:28 node1 corosync[15156]:  [TOTEM ] A new membership
 (m.n.o.p:80) was formed. Members joined: 1760315215
 Jan 14 10:48:28 node1 corosync[15156]:  [QUORUM] Members[1]: 1760315215
 Jan 14 10:48:28 node1 corosync[15156]:  [MAIN  ] Completed service
 synchronization, ready to provide service.
 
 
 On Mon, Jan 5, 2015 at 6:45 PM, Jan Friesse jfrie...@redhat.com wrote:
 
 Dmitry,


 Sure, in logs I see adding new UDPU member {IP_ADDRESS} (so DNS names
 are definitely resolved), but in practice the cluster does not work, as I
 said above. So validations of ringX_addr in corosync.conf would be very
 helpful in corosync.

 that's weird. Because as long as DNS is resolved, corosync works only
 with IP. This means, code path is exactly same with IP or with DNS. Do
 you have logs from corosync?

 Honza



 On Fri, Jan 2, 2015 at 2:49 PM, Jan Friesse jfrie...@redhat.com wrote:

 Dmitry,


  No, I meant that if you pass a domain name in ring0_addr, there are no
 errors in logs, corosync even seems to find nodes (based on its logs),
 And
 crm_node -l shows them, but in practice nothing really works. A verbose
 error message would be very helpful in such case.


 This sounds weird. Are you sure that DNS names really maps to correct IP
 address? In logs there should be something like adding new UDPU member
 {IP_ADDRESS}.

 Regards,
   Honza


 On Tuesday, December 30, 2014, Daniel Dehennin 
 daniel.dehen...@baby-gnu.org
 wrote:

  Dmitry Koterov dmitry.kote...@gmail.com javascript:; writes:

  Oh, seems I've found the solution! At least two mistakes was in my
 corosync.conf (BTW logs did not say about any errors, so my
 conclusion
 is
 based on my experiments only).

 1. nodelist.node MUST contain only IP addresses. No hostnames! They

 simply

 do not work, crm status shows no nodes. And no warnings are in logs
 regarding this.


 You can add name like this:

  nodelist {
node {
  ring0_addr: public-ip-address-of-the-first-machine
  name: node1
}
node {
  ring0_addr: public-ip-address-of-the-second-machine
  name: node2
}
  }

 I used it on Ubuntu Trusty with udpu.

 Regards.

 --
 Daniel Dehennin
 Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
 Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF




 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-14 Thread Jan Friesse
Kostiantyn,

 Honza,
 
 Thank you for helping me.
 So, there is no defined behavior in case one of the interfaces is not in
 the system?

You are right. There is no defined behavior.

Regards,
  Honza


 
 
 Thank you,
 Kostya
 
 On Tue, Jan 13, 2015 at 12:01 PM, Jan Friesse jfrie...@redhat.com wrote:
 
 Kostiantyn,


 According to the https://access.redhat.com/solutions/638843 , the
 interface, that is defined in the corosync.conf, must be present in the
 system (see at the bottom of the article, section ROOT CAUSE).
 To confirm that I made a couple of tests.

 Here is a part of the corosync.conf file (in a free-write form) (also
 attached the origin config file):
 ===
 rrp_mode: passive
 ring0_addr is defined in corosync.conf
 ring1_addr is defined in corosync.conf
 ===

 ---

 Two-node cluster

 ---

 Test #1:
 --
 IP for ring0 is not defines in the system:
 --
 Start Corosync simultaneously on both nodes.
 Corosync fails to start.
 From the logs:
 Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
 config: No interfaces defined
 Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
 Engine exiting with status 8 at main.c:1343.
 Result: Corosync and Pacemaker are not running.

 Test #2:
 --
 IP for ring1 is not defines in the system:
 --
 Start Corosync simultaneously on both nodes.
 Corosync starts.
 Start Pacemaker simultaneously on both nodes.
 Pacemaker fails to start.
 From the logs, the last writes from the corosync:
 Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0
 interface 169.254.1.3 FAULTY
 Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically
 recovered ring 0
 Result: Corosync and Pacemaker are not running.


 Test #3:

 rrp_mode: active leads to the same result, except Corosync and
 Pacemaker
 init scripts return status running.
 But still vim /var/log/cluster/corosync.log shows a lot of errors like:
 Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection
 to the CPG API failed: Library error (2)

 Result: Corosync and Pacemaker show their statuses as running, but
 crm_mon cannot connect to the cluster database. And half of the
 Pacemaker's services are not running (including Cluster Information Base
 (CIB)).


 ---

 For a single node mode

 ---

 IP for ring0 is not defines in the system:

 Corosync fails to start.

 IP for ring1 is not defines in the system:

 Corosync and Pacemaker are started.

 It is possible that configuration will be applied successfully (50%),

 and it is possible that the cluster is not running any resources,

 and it is possible that the node cannot be put in a standby mode (shows:
 communication error),

 and it is possible that the cluster is running all resources, but applied
 configuration is not guaranteed to be fully loaded (some rules can be
 missed).


 ---

 Conclusions:

 ---

 It is possible that in some rare cases (see comments to the bug) the
 cluster will work, but in that case its working state is unstable and the
 cluster can stop working every moment.


 So, is it correct? Does my assumptions make any sense? I didn't any other
 explanation in the network ... .

 Corosync needs all interfaces during start and runtime. This doesn't
 mean they must be connected (this would make corosync unusable for
 physical NIC/Switch or cable failure), but they must be up and have
 correct ip.

 When this is not the case, corosync rebinds to localhost and weird
 things happens. Removal of this rebinding is long time TODO, but there
 are still more important bugs (especially because rebind can be avoided).

 Regards,
   Honza




 Thank you,
 Kostya

 On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko 
 konstantin.ponomare...@gmail.com wrote:

 Hi guys,

 Corosync fails to start if there is no such network interface configured
 in the system.
 Even with rrp_mode: passive the problem is the same when at least one
 network interface is not configured in the system.

 Is this the expected behavior?
 I thought that when you use redundant rings, it is enough to have at
 least
 one NIC configured in the system. Am I wrong?

 Thank you,
 Kostya




 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http

Re: [Pacemaker] Corosync fails to start when NIC is absent

2015-01-13 Thread Jan Friesse
Kostiantyn,


 According to the https://access.redhat.com/solutions/638843 , the
 interface, that is defined in the corosync.conf, must be present in the
 system (see at the bottom of the article, section ROOT CAUSE).
 To confirm that I made a couple of tests.
 
 Here is a part of the corosync.conf file (in a free-write form) (also
 attached the origin config file):
 ===
 rrp_mode: passive
 ring0_addr is defined in corosync.conf
 ring1_addr is defined in corosync.conf
 ===
 
 ---
 
 Two-node cluster
 
 ---
 
 Test #1:
 --
 IP for ring0 is not defines in the system:
 --
 Start Corosync simultaneously on both nodes.
 Corosync fails to start.
 From the logs:
 Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] parse error in
 config: No interfaces defined
 Jan 08 09:43:56 [2992] A6-402-2 corosync error [MAIN ] Corosync Cluster
 Engine exiting with status 8 at main.c:1343.
 Result: Corosync and Pacemaker are not running.
 
 Test #2:
 --
 IP for ring1 is not defines in the system:
 --
 Start Corosync simultaneously on both nodes.
 Corosync starts.
 Start Pacemaker simultaneously on both nodes.
 Pacemaker fails to start.
 From the logs, the last writes from the corosync:
 Jan 8 16:31:29 daemon.err27 corosync[3728]: [TOTEM ] Marking ringid 0
 interface 169.254.1.3 FAULTY
 Jan 8 16:31:30 daemon.notice29 corosync[3728]: [TOTEM ] Automatically
 recovered ring 0
 Result: Corosync and Pacemaker are not running.
 
 
 Test #3:
 
 rrp_mode: active leads to the same result, except Corosync and Pacemaker
 init scripts return status running.
 But still vim /var/log/cluster/corosync.log shows a lot of errors like:
 Jan 08 16:30:47 [4067] A6-402-1 cib: error: pcmk_cpg_dispatch: Connection
 to the CPG API failed: Library error (2)
 
 Result: Corosync and Pacemaker show their statuses as running, but
 crm_mon cannot connect to the cluster database. And half of the
 Pacemaker's services are not running (including Cluster Information Base
 (CIB)).
 
 
 ---
 
 For a single node mode
 
 ---
 
 IP for ring0 is not defines in the system:
 
 Corosync fails to start.
 
 IP for ring1 is not defines in the system:
 
 Corosync and Pacemaker are started.
 
 It is possible that configuration will be applied successfully (50%),
 
 and it is possible that the cluster is not running any resources,
 
 and it is possible that the node cannot be put in a standby mode (shows:
 communication error),
 
 and it is possible that the cluster is running all resources, but applied
 configuration is not guaranteed to be fully loaded (some rules can be
 missed).
 
 
 ---
 
 Conclusions:
 
 ---
 
 It is possible that in some rare cases (see comments to the bug) the
 cluster will work, but in that case its working state is unstable and the
 cluster can stop working every moment.
 
 
 So, is it correct? Does my assumptions make any sense? I didn't any other
 explanation in the network ... .

Corosync needs all interfaces during start and runtime. This doesn't
mean they must be connected (this would make corosync unusable for
physical NIC/Switch or cable failure), but they must be up and have
correct ip.

When this is not the case, corosync rebinds to localhost and weird
things happens. Removal of this rebinding is long time TODO, but there
are still more important bugs (especially because rebind can be avoided).

Regards,
  Honza

 
 
 
 Thank you,
 Kostya
 
 On Fri, Jan 9, 2015 at 11:10 AM, Kostiantyn Ponomarenko 
 konstantin.ponomare...@gmail.com wrote:
 
 Hi guys,

 Corosync fails to start if there is no such network interface configured
 in the system.
 Even with rrp_mode: passive the problem is the same when at least one
 network interface is not configured in the system.

 Is this the expected behavior?
 I thought that when you use redundant rings, it is enough to have at least
 one NIC configured in the system. Am I wrong?

 Thank you,
 Kostya

 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CoroSync's UDPu transport for public IP addresses?

2015-01-05 Thread Jan Friesse
Dmitry,


 Sure, in logs I see adding new UDPU member {IP_ADDRESS} (so DNS names
 are definitely resolved), but in practice the cluster does not work, as I
 said above. So validations of ringX_addr in corosync.conf would be very
 helpful in corosync.

that's weird. Because as long as DNS is resolved, corosync works only
with IP. This means, code path is exactly same with IP or with DNS. Do
you have logs from corosync?

Honza


 
 On Fri, Jan 2, 2015 at 2:49 PM, Jan Friesse jfrie...@redhat.com wrote:
 
 Dmitry,


  No, I meant that if you pass a domain name in ring0_addr, there are no
 errors in logs, corosync even seems to find nodes (based on its logs), And
 crm_node -l shows them, but in practice nothing really works. A verbose
 error message would be very helpful in such case.


 This sounds weird. Are you sure that DNS names really maps to correct IP
 address? In logs there should be something like adding new UDPU member
 {IP_ADDRESS}.

 Regards,
   Honza


 On Tuesday, December 30, 2014, Daniel Dehennin 
 daniel.dehen...@baby-gnu.org
 wrote:

  Dmitry Koterov dmitry.kote...@gmail.com javascript:; writes:

  Oh, seems I've found the solution! At least two mistakes was in my
 corosync.conf (BTW logs did not say about any errors, so my conclusion
 is
 based on my experiments only).

 1. nodelist.node MUST contain only IP addresses. No hostnames! They

 simply

 do not work, crm status shows no nodes. And no warnings are in logs
 regarding this.


 You can add name like this:

  nodelist {
node {
  ring0_addr: public-ip-address-of-the-first-machine
  name: node1
}
node {
  ring0_addr: public-ip-address-of-the-second-machine
  name: node2
}
  }

 I used it on Ubuntu Trusty with udpu.

 Regards.

 --
 Daniel Dehennin
 Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
 Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF




 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CoroSync's UDPu transport for public IP addresses?

2015-01-02 Thread Jan Friesse

Dmitry,



No, I meant that if you pass a domain name in ring0_addr, there are no
errors in logs, corosync even seems to find nodes (based on its logs), And
crm_node -l shows them, but in practice nothing really works. A verbose
error message would be very helpful in such case.


This sounds weird. Are you sure that DNS names really maps to correct IP 
address? In logs there should be something like adding new UDPU member 
{IP_ADDRESS}.


Regards,
  Honza



On Tuesday, December 30, 2014, Daniel Dehennin daniel.dehen...@baby-gnu.org
wrote:


Dmitry Koterov dmitry.kote...@gmail.com javascript:; writes:


Oh, seems I've found the solution! At least two mistakes was in my
corosync.conf (BTW logs did not say about any errors, so my conclusion is
based on my experiments only).

1. nodelist.node MUST contain only IP addresses. No hostnames! They

simply

do not work, crm status shows no nodes. And no warnings are in logs
regarding this.


You can add name like this:

 nodelist {
   node {
 ring0_addr: public-ip-address-of-the-first-machine
 name: node1
   }
   node {
 ring0_addr: public-ip-address-of-the-second-machine
 name: node2
   }
 }

I used it on Ubuntu Trusty with udpu.

Regards.

--
Daniel Dehennin
Récupérer ma clef GPG: gpg --recv-keys 0xCC1E9E5B7A6FE2DF
Fingerprint: 3E69 014E 5C23 50E8 9ED6  2AAD CC1E 9E5B 7A6F E2DF





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CMAN and Pacemaker with IPv6

2014-07-16 Thread Jan Friesse
Teerapatr

 Dear Honza,
 
 Sorry to say this, but I found new error again. LOL
 
 This time, I already install the 1.4.1-17 as your advice.
 And the nodename, without altname, is map to IPv6 using hosts file.
 Everything is fine, but the 2 node can't communicate to each other.
 So I add the multicast address manually, using command `ccs -f
 /etc/cluster/cluster.conf --setmulticast ff::597` on both node.
 After that the CMAN cannot start.

ff:: is not valid ipv6 multicast address. Use something like ff3e::597.


 
 Starting cluster:
Checking if cluster has been disabled at boot...[  OK  ]
Checking Network Manager... [  OK  ]
Global setup... [  OK  ]
Loading kernel modules...   [  OK  ]
Mounting configfs...[  OK  ]
Starting cman... Timed-out waiting for cluster Check cluster logs for 
 details
[FAILED]
 
 I also found a lot of LOG, but I think that this is where the problem has 
 occur.
 
 Jul 15 13:36:14 corosync [MAIN  ] Corosync Cluster Engine ('1.4.1'):
 started and ready to provide service.
 Jul 15 13:36:14 corosync [MAIN  ] Corosync built-in features: nss dbus rdma 
 snmp
 Jul 15 13:36:14 corosync [MAIN  ] Successfully read config from
 /etc/cluster/cluster.conf
 Jul 15 13:36:14 corosync [MAIN  ] Successfully parsed cman config
 Jul 15 13:36:14 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
 Jul 15 13:36:14 corosync [TOTEM ] Initializing transmit/receive
 security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
 Jul 15 13:36:14 corosync [TOTEM ] Unable to bind the socket to receive
 multicast packets: Cannot assign requested address (99)
 Jul 15 13:36:14 corosync [TOTEM ] Could not set traffic priority:
 Socket operation on non-socket (88)
 Jul 15 13:36:14 corosync [TOTEM ] The network interface
 [2001:db8::151] is now up.
 Jul 15 13:36:14 corosync [QUORUM] Using quorum provider quorum_cman
 Jul 15 13:36:14 corosync [SERV  ] Service engine loaded: corosync
 cluster quorum service v0.1
 Jul 15 13:36:14 corosync [CMAN  ] CMAN 3.0.12.1 (built Apr 14 2014
 09:36:10) started
 Jul 15 13:36:14 corosync [SERV  ] Service engine loaded: corosync CMAN
 membership service 2.90
 Jul 15 13:36:14 corosync [SERV  ] Service engine loaded: openais
 checkpoint service B.01.01
 Jul 15 13:36:14 corosync [SERV  ] Service engine loaded: corosync
 extended virtual synchrony service
 Jul 15 13:36:14 corosync [SERV  ] Service engine loaded: corosync
 configuration service
 Jul 15 13:36:14 corosync [SERV  ] Service engine loaded: corosync
 cluster closed process group service v1.01
 Jul 15 13:36:14 corosync [SERV  ] Service engine loaded: corosync
 cluster config database access v1.01
 Jul 15 13:36:14 corosync [SERV  ] Service engine loaded: corosync
 profile loading service
 Jul 15 13:36:14 corosync [QUORUM] Using quorum provider quorum_cman
 Jul 15 13:36:14 corosync [SERV  ] Service engine loaded: corosync
 cluster quorum service v0.1
 Jul 15 13:36:14 corosync [MAIN  ] Compatibility mode set to whitetank.
 Using V1 and V2 of the synchronization engine.
 Jul 15 13:36:17 corosync [MAIN  ] Totem is unable to form a cluster
 because of an operating system or network fault. The most common cause
 of this message is that the local firewall is configured improperly.
 Jul 15 13:36:19 corosync [MAIN  ] Totem is unable to form a cluster
 because of an operating system or network fault. The most common cause
 of this message is that the local firewall is configured improperly.
 Jul 15 13:36:20 corosync [MAIN  ] Totem is unable to form a cluster
 because of an operating system or network fault. The most common cause
 of this message is that the local firewall is configured improperly.
 
 I cannot find the solution on Internet about [TOTEM ] Unable to bind
 the socket to receive multicast packets: Cannot assign requested
 address (99).
 Do you have any idea?
 
 Teenigma
 
 On Tue, Jul 15, 2014 at 10:02 AM, Teerapatr Kittiratanachai
 maillist...@gmail.com wrote:
 Honza

 Great, Thank you very much.

 But the terrible thing for me is I'm using the package from OpenSUSE repo.
 When i turn back to CentOS repo, which store lower version, the
 Dependency problem has occurred.

 Anyway, thank you for your help.

 Teenigma

 On Mon, Jul 14, 2014 at 8:51 PM, Jan Friesse jfrie...@redhat.com wrote:
 Honza,

 How do I include the patch with my CentOS package?
 Do I need to compile them manually?


 Yes. Also official CentOS version was never 1.4.5. If you are using CentOS,
 just use stock 1.4.1-17.1. Patch is included there.

 Honza



 TeEniGMa

 On Mon, Jul 14, 2014 at 3:21 PM, Jan Friesse jfrie...@redhat.com wrote:

 Teerapatr,


 For more information,


 these are LOG from /var/log/messages
 ...
 Jul 14 10:28:07 wh00 kernel: : DLM (built Mar 25 2014 20:01:13)
 installed
 Jul 14 10:28:07 wh00 corosync[2716]:   [MAIN

Re: [Pacemaker] CMAN and Pacemaker with IPv6

2014-07-14 Thread Jan Friesse

Teerapatr,

 For more information,


these are LOG from /var/log/messages
...
Jul 14 10:28:07 wh00 kernel: : DLM (built Mar 25 2014 20:01:13) installed
Jul 14 10:28:07 wh00 corosync[2716]:   [MAIN  ] Corosync Cluster
Engine ('1.4.5'): started and ready to provide service.
Jul 14 10:28:07 wh00 corosync[2716]:   [MAIN  ] Corosync built-in features: nss
Jul 14 10:28:07 wh00 corosync[2716]:   [MAIN  ] Successfully read
config from /etc/cluster/cluster.conf
Jul 14 10:28:07 wh00 corosync[2716]:   [MAIN  ] Successfully parsed cman config
Jul 14 10:28:07 wh00 corosync[2716]:   [TOTEM ] Initializing transport
(UDP/IP Multicast).
Jul 14 10:28:07 wh00 corosync[2716]:   [TOTEM ] Initializing
transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jul 14 10:28:07 wh00 corosync[2716]:   [TOTEM ] The network interface is down.
^^^ This line is important. This means, corosync was unable to find 
interface with given IPv6 address. There was regression in v1.4.5 
causing this behavior. It's fixed in v1.4.6 (patch is 
https://github.com/corosync/corosync/commit/d76759ec26ecaeb9cc01f49e9eb0749b61454d27). 
So you can ether apply patch or (recommended) upgrade to 1.4.7.


Regards,
  Honza



Jul 14 10:28:10 wh00 pacemaker: Aborting startup of Pacemaker Cluster Manager
...

Te

On Mon, Jul 14, 2014 at 10:07 AM, Teerapatr Kittiratanachai
maillist...@gmail.com wrote:

Dear Honza,

Sorry for late reply.
After I have tested with all new configuration.
On IPv6 only, and with no altname.

I face with error below,

Starting cluster:
Checking if cluster has been disabled at boot...[  OK  ]
Checking Network Manager... [  OK  ]
Global setup... [  OK  ]
Loading kernel modules...   [  OK  ]
Mounting configfs...[  OK  ]
Starting cman... corosync died with signal: 6 Check cluster logs for details
[FAILED]

And, exactly, there are no any enabled firewall, I also configure the
Multicast address as manual.
Could you advise me the solution?

Many thanks in advance.
Te

On Thu, Jul 10, 2014 at 6:14 PM, Jan Friesse jfrie...@redhat.com wrote:

Teerapatr,


Hi Honza,

As you said I use the nodename identify by hostname (which be accessed
via IPv6) and the node also has the altname (which be IPv4 address).



This doesn't work. Both hostname and altname have to be same IP version.


Now, I configure the mcast address for both nodename and altname
manually. The CMAN and Pacemaker can start ad well. But they don't
communicate to another node.


PLease make sure (as I've wrote in previous email) your firewall doesn't
block mcast and corosync traffic (just disable it) and switch doesn't
block multicast (this is very often the case). If these are VMs, make
sure to properly configure bridge (just disable firewall) and allow
mcast_querier.

Honza


On node0, crm_mon show node1 offline. In the same way, node one show
node0 is down. So the split brain problem occur here.

Regards,
Te

On Thu, Jul 10, 2014 at 2:50 PM, Jan Friesse jfrie...@redhat.com wrote:

Teerapatr,


OK, some problems are solved.
I use the incorrect hostname.

For now, the new problem has occured.

   Starting cman... Node address family does not match multicast address family
Unable to get the configuration
Node address family does not match multicast address family
cman_tool: corosync daemon didn't start Check cluster logs for details
[FAILED]



This looks like one of your node is also reachable via ipv4 and ipv4
resolving is proffered. Please make sure to set only ipv6 address and
try it again. Of course set mcast addr by hand maybe helpful (even-tho I
don't believe it will solve problem you are hitting)).

Also make sure ip6tables are properly configured and your switch is able
to pass ipv6 mcast traffic.

Regards,
   Honza


How can i fix it? Or just assigned the multicast address in the configuration?

Regards,
Te

On Thu, Jul 10, 2014 at 7:52 AM, Teerapatr Kittiratanachai
maillist...@gmail.com wrote:

I not found any LOG message

/var/log/messages
...
Jul 10 07:44:19 nwh00 kernel: : DLM (built Jun 19 2014 21:16:01) installed
Jul 10 07:44:22 nwh00 pacemaker: Aborting startup of Pacemaker Cluster Manager
...

and this is what display when I try to start pacemaker

# /etc/init.d/pacemaker start
Starting cluster:
Checking if cluster has been disabled at boot...[  OK  ]
Checking Network Manager... [  OK  ]
Global setup... [  OK  ]
Loading kernel modules...   [  OK  ]
Mounting configfs...[  OK  ]
Starting cman... Cannot find node name in cluster.conf
Unable to get the configuration
Cannot find node name in cluster.conf
cman_tool

Re: [Pacemaker] CMAN and Pacemaker with IPv6

2014-07-14 Thread Jan Friesse

Honza,

How do I include the patch with my CentOS package?
Do I need to compile them manually?


Yes. Also official CentOS version was never 1.4.5. If you are using 
CentOS, just use stock 1.4.1-17.1. Patch is included there.


Honza



TeEniGMa

On Mon, Jul 14, 2014 at 3:21 PM, Jan Friesse jfrie...@redhat.com wrote:

Teerapatr,



For more information,


these are LOG from /var/log/messages
...
Jul 14 10:28:07 wh00 kernel: : DLM (built Mar 25 2014 20:01:13) installed
Jul 14 10:28:07 wh00 corosync[2716]:   [MAIN  ] Corosync Cluster
Engine ('1.4.5'): started and ready to provide service.
Jul 14 10:28:07 wh00 corosync[2716]:   [MAIN  ] Corosync built-in
features: nss
Jul 14 10:28:07 wh00 corosync[2716]:   [MAIN  ] Successfully read
config from /etc/cluster/cluster.conf
Jul 14 10:28:07 wh00 corosync[2716]:   [MAIN  ] Successfully parsed cman
config
Jul 14 10:28:07 wh00 corosync[2716]:   [TOTEM ] Initializing transport
(UDP/IP Multicast).
Jul 14 10:28:07 wh00 corosync[2716]:   [TOTEM ] Initializing
transmit/receive security: libtomcrypt SOBER128/SHA1HMAC (mode 0).
Jul 14 10:28:07 wh00 corosync[2716]:   [TOTEM ] The network interface is
down.


^^^ This line is important. This means, corosync was unable to find
interface with given IPv6 address. There was regression in v1.4.5 causing
this behavior. It's fixed in v1.4.6 (patch is
https://github.com/corosync/corosync/commit/d76759ec26ecaeb9cc01f49e9eb0749b61454d27).
So you can ether apply patch or (recommended) upgrade to 1.4.7.

Regards,
   Honza




Jul 14 10:28:10 wh00 pacemaker: Aborting startup of Pacemaker Cluster
Manager
...

Te

On Mon, Jul 14, 2014 at 10:07 AM, Teerapatr Kittiratanachai
maillist...@gmail.com wrote:


Dear Honza,

Sorry for late reply.
After I have tested with all new configuration.
On IPv6 only, and with no altname.

I face with error below,

Starting cluster:
 Checking if cluster has been disabled at boot...[  OK  ]
 Checking Network Manager... [  OK  ]
 Global setup... [  OK  ]
 Loading kernel modules...   [  OK  ]
 Mounting configfs...[  OK  ]
 Starting cman... corosync died with signal: 6 Check cluster logs for
details
 [FAILED]

And, exactly, there are no any enabled firewall, I also configure the
Multicast address as manual.
Could you advise me the solution?

Many thanks in advance.
Te

On Thu, Jul 10, 2014 at 6:14 PM, Jan Friesse jfrie...@redhat.com wrote:


Teerapatr,


Hi Honza,

As you said I use the nodename identify by hostname (which be accessed
via IPv6) and the node also has the altname (which be IPv4 address).



This doesn't work. Both hostname and altname have to be same IP version.


Now, I configure the mcast address for both nodename and altname
manually. The CMAN and Pacemaker can start ad well. But they don't
communicate to another node.



PLease make sure (as I've wrote in previous email) your firewall doesn't
block mcast and corosync traffic (just disable it) and switch doesn't
block multicast (this is very often the case). If these are VMs, make
sure to properly configure bridge (just disable firewall) and allow
mcast_querier.

Honza


On node0, crm_mon show node1 offline. In the same way, node one show
node0 is down. So the split brain problem occur here.

Regards,
Te

On Thu, Jul 10, 2014 at 2:50 PM, Jan Friesse jfrie...@redhat.com
wrote:


Teerapatr,


OK, some problems are solved.
I use the incorrect hostname.

For now, the new problem has occured.

Starting cman... Node address family does not match multicast
address family
Unable to get the configuration
Node address family does not match multicast address family
cman_tool: corosync daemon didn't start Check cluster logs for
details
 [FAILED]



This looks like one of your node is also reachable via ipv4 and ipv4
resolving is proffered. Please make sure to set only ipv6 address and
try it again. Of course set mcast addr by hand maybe helpful (even-tho
I
don't believe it will solve problem you are hitting)).

Also make sure ip6tables are properly configured and your switch is
able
to pass ipv6 mcast traffic.

Regards,
Honza


How can i fix it? Or just assigned the multicast address in the
configuration?

Regards,
Te

On Thu, Jul 10, 2014 at 7:52 AM, Teerapatr Kittiratanachai
maillist...@gmail.com wrote:


I not found any LOG message

/var/log/messages
...
Jul 10 07:44:19 nwh00 kernel: : DLM (built Jun 19 2014 21:16:01)
installed
Jul 10 07:44:22 nwh00 pacemaker: Aborting startup of Pacemaker
Cluster Manager
...

and this is what display when I try to start pacemaker

# /etc/init.d/pacemaker start
Starting cluster:
 Checking if cluster has been disabled at boot...[  OK  ]
 Checking Network Manager... [  OK  ]
 Global setup

Re: [Pacemaker] CMAN and Pacemaker with IPv6

2014-07-10 Thread Jan Friesse
Teerapatr,

 OK, some problems are solved.
 I use the incorrect hostname.
 
 For now, the new problem has occured.
 
   Starting cman... Node address family does not match multicast address family
 Unable to get the configuration
 Node address family does not match multicast address family
 cman_tool: corosync daemon didn't start Check cluster logs for details
[FAILED]
 

This looks like one of your node is also reachable via ipv4 and ipv4
resolving is proffered. Please make sure to set only ipv6 address and
try it again. Of course set mcast addr by hand maybe helpful (even-tho I
don't believe it will solve problem you are hitting)).

Also make sure ip6tables are properly configured and your switch is able
to pass ipv6 mcast traffic.

Regards,
  Honza

 How can i fix it? Or just assigned the multicast address in the configuration?
 
 Regards,
 Te
 
 On Thu, Jul 10, 2014 at 7:52 AM, Teerapatr Kittiratanachai
 maillist...@gmail.com wrote:
 I not found any LOG message

 /var/log/messages
 ...
 Jul 10 07:44:19 nwh00 kernel: : DLM (built Jun 19 2014 21:16:01) installed
 Jul 10 07:44:22 nwh00 pacemaker: Aborting startup of Pacemaker Cluster 
 Manager
 ...

 and this is what display when I try to start pacemaker

 # /etc/init.d/pacemaker start
 Starting cluster:
Checking if cluster has been disabled at boot...[  OK  ]
Checking Network Manager... [  OK  ]
Global setup... [  OK  ]
Loading kernel modules...   [  OK  ]
Mounting configfs...[  OK  ]
Starting cman... Cannot find node name in cluster.conf
 Unable to get the configuration
 Cannot find node name in cluster.conf
 cman_tool: corosync daemon didn't start Check cluster logs for details
[FAILED]
 Stopping cluster:
Leaving fence domain... [  OK  ]
Stopping gfs_controld...[  OK  ]
Stopping dlm_controld...[  OK  ]
Stopping fenced...  [  OK  ]
Stopping cman...[  OK  ]
Unloading kernel modules... [  OK  ]
Unmounting configfs...  [  OK  ]
 Aborting startup of Pacemaker Cluster Manager

 another one thing, according to the happened problem, I remove the
  record from DNS for now and map it in to /etc/hosts files
 instead, as shown below.

 /etc/hosts
 ...
 2001:db8:0:1::1   node0.example.com
 2001:db8:0:1::2   node1.example.com
 ...

 Is there any configure that help me to got more log ?

 On Thu, Jul 10, 2014 at 5:06 AM, Andrew Beekhof and...@beekhof.net wrote:

 On 9 Jul 2014, at 9:15 pm, Teerapatr Kittiratanachai 
 maillist...@gmail.com wrote:

 Dear All,

 I has implemented the HA on dual stack servers,
 Firstly, I doesn't deploy IPv6 record on DNS yet. The CMAN and
 PACEMAKER can work as normal.
 But, after I create  record on DNS server, i found the error that
 cann't start CMAN.

 Are CMAN and PACEMAKER  support the IPv6?

 I don;t think pacemaker cares.
 What errors did you get?

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org

 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] CMAN and Pacemaker with IPv6

2014-07-10 Thread Jan Friesse
Teerapatr,

 Hi Honza,
 
 As you said I use the nodename identify by hostname (which be accessed
 via IPv6) and the node also has the altname (which be IPv4 address).
 

This doesn't work. Both hostname and altname have to be same IP version.

 Now, I configure the mcast address for both nodename and altname
 manually. The CMAN and Pacemaker can start ad well. But they don't
 communicate to another node.

PLease make sure (as I've wrote in previous email) your firewall doesn't
block mcast and corosync traffic (just disable it) and switch doesn't
block multicast (this is very often the case). If these are VMs, make
sure to properly configure bridge (just disable firewall) and allow
mcast_querier.

Honza

 On node0, crm_mon show node1 offline. In the same way, node one show
 node0 is down. So the split brain problem occur here.
 
 Regards,
 Te
 
 On Thu, Jul 10, 2014 at 2:50 PM, Jan Friesse jfrie...@redhat.com wrote:
 Teerapatr,

 OK, some problems are solved.
 I use the incorrect hostname.

 For now, the new problem has occured.

   Starting cman... Node address family does not match multicast address 
 family
 Unable to get the configuration
 Node address family does not match multicast address family
 cman_tool: corosync daemon didn't start Check cluster logs for details
[FAILED]


 This looks like one of your node is also reachable via ipv4 and ipv4
 resolving is proffered. Please make sure to set only ipv6 address and
 try it again. Of course set mcast addr by hand maybe helpful (even-tho I
 don't believe it will solve problem you are hitting)).

 Also make sure ip6tables are properly configured and your switch is able
 to pass ipv6 mcast traffic.

 Regards,
   Honza

 How can i fix it? Or just assigned the multicast address in the 
 configuration?

 Regards,
 Te

 On Thu, Jul 10, 2014 at 7:52 AM, Teerapatr Kittiratanachai
 maillist...@gmail.com wrote:
 I not found any LOG message

 /var/log/messages
 ...
 Jul 10 07:44:19 nwh00 kernel: : DLM (built Jun 19 2014 21:16:01) installed
 Jul 10 07:44:22 nwh00 pacemaker: Aborting startup of Pacemaker Cluster 
 Manager
 ...

 and this is what display when I try to start pacemaker

 # /etc/init.d/pacemaker start
 Starting cluster:
Checking if cluster has been disabled at boot...[  OK  ]
Checking Network Manager... [  OK  ]
Global setup... [  OK  ]
Loading kernel modules...   [  OK  ]
Mounting configfs...[  OK  ]
Starting cman... Cannot find node name in cluster.conf
 Unable to get the configuration
 Cannot find node name in cluster.conf
 cman_tool: corosync daemon didn't start Check cluster logs for details
[FAILED]
 Stopping cluster:
Leaving fence domain... [  OK  ]
Stopping gfs_controld...[  OK  ]
Stopping dlm_controld...[  OK  ]
Stopping fenced...  [  OK  ]
Stopping cman...[  OK  ]
Unloading kernel modules... [  OK  ]
Unmounting configfs...  [  OK  ]
 Aborting startup of Pacemaker Cluster Manager

 another one thing, according to the happened problem, I remove the
  record from DNS for now and map it in to /etc/hosts files
 instead, as shown below.

 /etc/hosts
 ...
 2001:db8:0:1::1   node0.example.com
 2001:db8:0:1::2   node1.example.com
 ...

 Is there any configure that help me to got more log ?

 On Thu, Jul 10, 2014 at 5:06 AM, Andrew Beekhof and...@beekhof.net wrote:

 On 9 Jul 2014, at 9:15 pm, Teerapatr Kittiratanachai 
 maillist...@gmail.com wrote:

 Dear All,

 I has implemented the HA on dual stack servers,
 Firstly, I doesn't deploy IPv6 record on DNS yet. The CMAN and
 PACEMAKER can work as normal.
 But, after I create  record on DNS server, i found the error that
 cann't start CMAN.

 Are CMAN and PACEMAKER  support the IPv6?

 I don;t think pacemaker cares.
 What errors did you get?

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org

Re: [Pacemaker] [Openais] unmanaged resource failed - how to get back?

2014-06-30 Thread Jan Friesse

Stefan,
sending to Pacemaker list because your question seems to be not Corosync 
related.


Regards,
  Honza

Senftleben, Stefan (itsc) napsal(a):

Hello,

I set the cluster in a maintainance mode with: crm configure property 
maintenance-mode=true .
Afterwards I did stop one resource manually, but after turning of the maintainance mode, 
the resource is in status unmanaged FAILED.
But the resource is running already.
What shoud I do now, to get the resource managed by pacemaker?

Greetings
Stefan



Last updated: Mon Jun 30 12:42:45 2014
Last change: Mon Jun 30 12:41:33 2014
Stack: openais
Current DC: lxds05 - partition with quorum
Version: 1.1.6-9971ebba4494012a93c03b40a2c58ec0eb60f50c
2 Nodes configured, 2 expected votes
10 Resources configured.


Online: [ lxds05 lxds07 ]

Full list of resources:

Resource Group: group_omd
  pri_fs_omd (ocf::heartbeat:Filesystem):Started lxds05
  pri_apache2(ocf::heartbeat:apache):Started lxds05
  pri_nagiosIP   (ocf::heartbeat:IPaddr2):   Started lxds05
Master/Slave Set: ms_drbd_omd [pri_drbd_omd]
  Masters: [ lxds05 ]
  Slaves: [ lxds07 ]
Clone Set: clone_ping [pri_ping]
  Started: [ lxds07 lxds05 ]
res_MailTo_omd_group(ocf::heartbeat:MailTo):Stopped
omd_itsc(ocf::omd:omdnagios):   Started lxds05 (unmanaged) FAILED
res_MailTo_omd_itsc (ocf::heartbeat:MailTo):Stopped

Node Attributes:
* Node lxds05:
 + master-pri_drbd_omd:0 : 1
 + pingd : 3000
* Node lxds07:
 + master-pri_drbd_omd:1 : 1
 + pingd : 3000

Migration summary:
* Node lxds07:
* Node lxds05:
omd_itsc: migration-threshold=100 fail-count=2 last-failure='Mon Jun 30 
12:39:03 2014'

Failed actions:
 omd_itsc_stop_0 (node=lxds05, call=49, rc=1, status=complete): unknown 
error



___
Openais mailing list
open...@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/openais




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Openais] Filesystem vs. Master-Slave MySQL resource

2014-06-03 Thread Jan Friesse
Matej,
this is really question for pacemaker mailing list.

 Hello,
 
 I have the following setup:
 
 2 nodes: db-01, db-02
 Groups of resources:
 fs-01: iscsi+lvm+fs at db-01
 fs-02: iscsi+lvm+fs at db-02
 
 fs-01 is for mounting data files for MySQL at db-01, fs-02 for db-02
 
 MySQL resources:
 
 primitive p_mysql mysql \
 params binary=/usr/bin/mysqld_safe config=/etc/my.cnf 
 datadir=/var/lib/mysql/db replication_user=replicant 
 replication_passwd= test_user=test test_passwd= \
 op start timeout=120 interval=0 \
 op stop timeout=120 interval=0 \
 op promote timeout=120 interval=0 \
 op demote timeout=120 interval=0 \
 op monitor role=Master timeout=30 interval=5 \
 op monitor role=Slave timeout=30 interval=8
 
 ms ms_mysql p_mysql \
 meta notify=true master-max=1 clone-max=2 target-role=Started 
 is-managed=true
 
 
 To force groups at right nodes I have following:
 
 location loc_mysql-1 fs-01 inf: db-01
 location loc_mysql-1n fs-01 -inf: db-02
 location loc_mysql-2 fs-02 inf: db-02
 location loc_mysql-2n fs-02 -inf: db-01
 
 I have troubles with order.
 
 I need to configure startup of ms_mysql after FS mounts.
 There are several scenarios:
 
 1) Both nodes online
 - start both fs-01 and fs-02
 - start ms_mysql, one node as Master, other as Slave
 
 2) Only one node online
 - start related fs-0x
 - start ms_mysql at one node
 
 3) Running both nodes, standby slave
 - stop ms_mysql:Slave
 - stop related fs
 
 4) Running both nodes, standby master
 - demote master
 - promote slave to became master
 - stop slave (ex master)
 - stop related fs
 
 I have troubles to configure the right dependecies betwen fs-01, fs-02, 
 ms_mysql:start, ms_mysql:promote, etc...
 
 I can provide more detais as needed.
 
 Thanks for your help.
 
 Best regards
 Matej Gajdos
 
 — 
 e-mail: matej.gaj...@digmia.com
 
 DIGMIA s.r.o.
 Lazaretská 12
 81108 Bratislava
 
 
 
 
 ___
 Openais mailing list
 open...@lists.linux-foundation.org
 https://lists.linuxfoundation.org/mailman/listinfo/openais
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] auto_tie_breaker in two node cluster

2014-05-21 Thread Jan Friesse
 I am not quite understand how auto_tie_breaker works.
 Say we have a cluster with 2 nodes and enabled auto_tie_breaker feature.
 Each node has 2 NICs. One NIC is used for cluster communication and another
 one is used for providing some services from the cluster.
 So the question is how the nodes will distinguish between two possible
 situations:
 1) connection between the nodes are lost, but the both nodes remain working;
 2) power supply on the node 1 (has the lowest node-id) broke down and node
 2 remain working;
 
 In 1st case, according to the description of the auto_tie_breaker, the node
 with the lowest node-id in the cluster will remain working.
 And in that particular situation it is good result because the both nodes
 are in good state (the both can remain working).
 In 2nd case the only working node is #2 and the node-id of that node is not
 the lowest one. So what will be in this case? What logic will work, because
 we have lost the node with the lowest node id in 2-node cluster?
 
 there is no qdiskd for votequorum yet
 Is there plans to implement it?
 

Kostya,
yes there are plans to implement qdisk (network based one).

Regards,
  Honza


 Many thanks,
 Kostya
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] pacemaker not started by corosync on ubuntu 14.04

2014-05-12 Thread Jan Friesse

Vladimir,

Vladimir napsal(a):

Hello everyone,

I'm trying to get corosync/pacemaker run on Ubuntu 14.04. In my Ubuntu
12.04 setups pacemaker was started by corosync. Actually I thought the


Yes. 12.04 used corosync 1.x with pacemaker plugin.


service {...} section in the corosync.conf is specified for this
purpose. Of course I could put pacemaker into the runlevel but I asked
myself if the behaviour was just changed or if I maybe have a mistake in
my corosync.conf.



Behavior just changed. 14.04 uses corosync 2.x and there are no plugins 
(service section).  So pacemaker is no longer started by corosync and 
you have to start both corosync and pacemaker (I believe upstart can 
handle dependencies, so probably starting only pacemaker is enough).


Regards,
  Honza


I started with this minimal corosync.conf:

totem {
 version: 2
 secauth: off
 interface {
 ringnumber: 0
 bindnetaddr: 172.16.100.0
 mcastaddr: 239.255.42.1
 mcastport: 5405
 }
}

service {
name: pacemaker
ver: 1
}

quorum {
 provider: corosync_votequorum
expected_votes: 2
}

aisexec {
 user:   root
 group:  root
}

logging {
 fileline: off
 to_stderr: yes
 to_logfile: no
 to_syslog: yes
 syslog_facility: daemon
 debug: on
 timestamp: on
 logger_subsys {
 subsys: AMF
 debug: off
 tags: enter|leave|trace1|trace2|trace3|trace4|trace6
 }
}

Thanks in advance.

Kind regards
Vladimir

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org





___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] corosync [TOTEM ] Process pause detected for 577 ms

2014-05-05 Thread Jan Friesse
Emmanuel,

emmanuel segura napsal(a):
 Helllo Jan,
 
 I'm using corosync+pacemaker on Sles 11 Sp1 and this is a critical system,

Oh, ok.

 i don't think i'll get the authorization for upgrade system, but i would
 like to know if there is any bug about this issue in my current corosync
 release.

This is hard to say. Suse guys probably included many patches, so it
would make sense to try to contact Suse support.

After very very quick look to git, following patches may be related:
559d4083ed8355fe83f275e53b9c8f52a91694b2,
02c5dffa5bb8579c223006fa1587de9ba7409a3d,
64d0e5ace025cc929e42896c5d6beb3ef75b8244,
6fae42ba72006941c1fde99616ea30f4f10ebb38,
c7e686181bcd0e975b09725502bef02c7d0c338a.

But still keep in mind that between latest 1.3.6 (what I believe is more
or less what you are using) and current origin/flatiron are 118 patches...

Regards,
  Honza

 
 Thanks
 Emmanuel
 
 
 2014-04-30 17:07 GMT+02:00 Jan Friesse jfrie...@redhat.com:
 
 Emmanuel,

 emmanuel segura napsal(a):
 Hello Jan,

 Thanks for the explanation, but i saw this in my log.


 ::

 corosync [TOTEM ] Process pause detected for 577 ms, flushing membership
 messages.
 corosync [TOTEM ] Process pause detected for 538 ms, flushing membership
 messages.
 corosync [TOTEM ] A processor failed, forming new configuration.
 corosync [CLM   ] CLM CONFIGURATION CHANGE
 corosync [CLM   ] New Configuration:
 corosync [CLM   ]   r(0) ip(10.xxx.xxx.xxx)
 corosync [CLM   ] Members Left:
 corosync [CLM   ]   r(0) ip(10.xxx.xxx.xxx)
 corosync [CLM   ] Members Joined:
 corosync [pcmk  ] notice: pcmk_peer_update: Transitional membership event
 on ring 6904: memb=1, new=0, lost=1
 corosync [pcmk  ] info: pcmk_peer_update: memb: node01 891257354
 corosync [pcmk  ] info: pcmk_peer_update: lost: node02 874480


 :

 when this happen, corosync needs to retransmit the toten?
 from what i understood the toten need to be retransmit, but in my case a
 new configuration was formed

 This my corosync version

 corosync-1.3.3-0.3.1


 1.3.3 is unsupported for ages. Please upgrade to newest 1.4.6 (if you
 are using cman) or 2.3.3 (if you are not using cman). Also please change
 your pacemaker to not use plugin (upgrade to 2.3.3 will solve it
 automatically, because plugins in corosync 2.x are no longer support).

 Regards,
   Honza


 Thanks


 2014-04-30 9:42 GMT+02:00 Jan Friesse jfrie...@redhat.com:

 Emmanuel,
 there is no need to trigger fencing on Process pause detected

 Also fencing is not triggered if membership didn't changed. So let's say
 token was lost but during gather state all nodes replied, then there is
 no change of membership and no need to fence.

 I believe your situation was:
 - one node is little overloaded
 - token lost
 - overload over
 - gather state
 - every node is alive
 - no fencing

 Regards,
   Honza

 emmanuel segura napsal(a):
 Hello Jan,

 Forget the last mail:

 Hello Jan,

 I found this problem in two hp blade system and the strange thing is
 the
 fencing was not triggered :(, but it's enabled


 2014-04-25 18:36 GMT+02:00 emmanuel segura emi2f...@gmail.com:

 Hello Jan,

 I found this problem in two hp blade system and the strange thing is
 the
 fencing was triggered :(


 2014-04-25 9:27 GMT+02:00 Jan Friesse jfrie...@redhat.com:

 Emanuel,

 emmanuel segura napsal(a):

  Hello List,

 I have this two lines in my cluster logs, somebody can help to know
 what
 this means.

 
 
 ::

 corosync [TOTEM ] Process pause detected for 577 ms, flushing
 membership
 messages.
 corosync [TOTEM ] Process pause detected for 538 ms, flushing
 membership
 messages.


 Corosync internally checks gap between member join messages. If such
 gap
 is  token/2, it means, that corosync was not scheduled to run by
 kernel
 for too long, and it should discard membership messages.

 Original intend was to detect paused process. If pause is detected,
 it's
 better to discard old membership messages and initiate new query then
 sending outdated view.

 So there are various reasons why this is triggered, but today it's
 usually VM with overloaded host machine.



  corosync [TOTEM ] A processor failed, forming new configuration.

 
 
 ::

 I know the corosync [TOTEM ] A processor failed, forming new
 configuration message is when the toten package is definitely lost.

 Thanks


 Regards,
   Honza



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman

Re: [Pacemaker] corosync [TOTEM ] Process pause detected for 577 ms

2014-04-30 Thread Jan Friesse
Emmanuel,
there is no need to trigger fencing on Process pause detected

Also fencing is not triggered if membership didn't changed. So let's say
token was lost but during gather state all nodes replied, then there is
no change of membership and no need to fence.

I believe your situation was:
- one node is little overloaded
- token lost
- overload over
- gather state
- every node is alive
- no fencing

Regards,
  Honza

emmanuel segura napsal(a):
 Hello Jan,
 
 Forget the last mail:
 
 Hello Jan,
 
 I found this problem in two hp blade system and the strange thing is the
 fencing was not triggered :(, but it's enabled
 
 
 2014-04-25 18:36 GMT+02:00 emmanuel segura emi2f...@gmail.com:
 
 Hello Jan,

 I found this problem in two hp blade system and the strange thing is the
 fencing was triggered :(


 2014-04-25 9:27 GMT+02:00 Jan Friesse jfrie...@redhat.com:

 Emanuel,

 emmanuel segura napsal(a):

  Hello List,

 I have this two lines in my cluster logs, somebody can help to know what
 this means.

 
 
 ::

 corosync [TOTEM ] Process pause detected for 577 ms, flushing membership
 messages.
 corosync [TOTEM ] Process pause detected for 538 ms, flushing membership
 messages.


 Corosync internally checks gap between member join messages. If such gap
 is  token/2, it means, that corosync was not scheduled to run by kernel
 for too long, and it should discard membership messages.

 Original intend was to detect paused process. If pause is detected, it's
 better to discard old membership messages and initiate new query then
 sending outdated view.

 So there are various reasons why this is triggered, but today it's
 usually VM with overloaded host machine.



  corosync [TOTEM ] A processor failed, forming new configuration.

 
 
 ::

 I know the corosync [TOTEM ] A processor failed, forming new
 configuration message is when the toten package is definitely lost.

 Thanks


 Regards,
   Honza



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




 --
 esta es mi vida e me la vivo hasta que dios quiera

 
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] corosync [TOTEM ] Process pause detected for 577 ms

2014-04-30 Thread Jan Friesse
Emmanuel,

emmanuel segura napsal(a):
 Hello Jan,
 
 Thanks for the explanation, but i saw this in my log.
 
 ::
 
 corosync [TOTEM ] Process pause detected for 577 ms, flushing membership
 messages.
 corosync [TOTEM ] Process pause detected for 538 ms, flushing membership
 messages.
 corosync [TOTEM ] A processor failed, forming new configuration.
 corosync [CLM   ] CLM CONFIGURATION CHANGE
 corosync [CLM   ] New Configuration:
 corosync [CLM   ]   r(0) ip(10.xxx.xxx.xxx)
 corosync [CLM   ] Members Left:
 corosync [CLM   ]   r(0) ip(10.xxx.xxx.xxx)
 corosync [CLM   ] Members Joined:
 corosync [pcmk  ] notice: pcmk_peer_update: Transitional membership event
 on ring 6904: memb=1, new=0, lost=1
 corosync [pcmk  ] info: pcmk_peer_update: memb: node01 891257354
 corosync [pcmk  ] info: pcmk_peer_update: lost: node02 874480
 
 :
 
 when this happen, corosync needs to retransmit the toten?
 from what i understood the toten need to be retransmit, but in my case a
 new configuration was formed
 
 This my corosync version
 
 corosync-1.3.3-0.3.1
 

1.3.3 is unsupported for ages. Please upgrade to newest 1.4.6 (if you
are using cman) or 2.3.3 (if you are not using cman). Also please change
your pacemaker to not use plugin (upgrade to 2.3.3 will solve it
automatically, because plugins in corosync 2.x are no longer support).

Regards,
  Honza


 Thanks
 
 
 2014-04-30 9:42 GMT+02:00 Jan Friesse jfrie...@redhat.com:
 
 Emmanuel,
 there is no need to trigger fencing on Process pause detected

 Also fencing is not triggered if membership didn't changed. So let's say
 token was lost but during gather state all nodes replied, then there is
 no change of membership and no need to fence.

 I believe your situation was:
 - one node is little overloaded
 - token lost
 - overload over
 - gather state
 - every node is alive
 - no fencing

 Regards,
   Honza

 emmanuel segura napsal(a):
 Hello Jan,

 Forget the last mail:

 Hello Jan,

 I found this problem in two hp blade system and the strange thing is the
 fencing was not triggered :(, but it's enabled


 2014-04-25 18:36 GMT+02:00 emmanuel segura emi2f...@gmail.com:

 Hello Jan,

 I found this problem in two hp blade system and the strange thing is the
 fencing was triggered :(


 2014-04-25 9:27 GMT+02:00 Jan Friesse jfrie...@redhat.com:

 Emanuel,

 emmanuel segura napsal(a):

  Hello List,

 I have this two lines in my cluster logs, somebody can help to know
 what
 this means.

 
 
 ::

 corosync [TOTEM ] Process pause detected for 577 ms, flushing
 membership
 messages.
 corosync [TOTEM ] Process pause detected for 538 ms, flushing
 membership
 messages.


 Corosync internally checks gap between member join messages. If such
 gap
 is  token/2, it means, that corosync was not scheduled to run by
 kernel
 for too long, and it should discard membership messages.

 Original intend was to detect paused process. If pause is detected,
 it's
 better to discard old membership messages and initiate new query then
 sending outdated view.

 So there are various reasons why this is triggered, but today it's
 usually VM with overloaded host machine.



  corosync [TOTEM ] A processor failed, forming new configuration.

 
 
 ::

 I know the corosync [TOTEM ] A processor failed, forming new
 configuration message is when the toten package is definitely lost.

 Thanks


 Regards,
   Honza



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org




 --
 esta es mi vida e me la vivo hasta que dios quiera






 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



 ___
 Pacemaker mailing list: Pacemaker

Re: [Pacemaker] corosync [TOTEM ] Process pause detected for 577 ms

2014-04-25 Thread Jan Friesse

Emanuel,

emmanuel segura napsal(a):

Hello List,

I have this two lines in my cluster logs, somebody can help to know what
this means.

::

corosync [TOTEM ] Process pause detected for 577 ms, flushing membership
messages.
corosync [TOTEM ] Process pause detected for 538 ms, flushing membership
messages.


Corosync internally checks gap between member join messages. If such gap 
is  token/2, it means, that corosync was not scheduled to run by kernel 
for too long, and it should discard membership messages.


Original intend was to detect paused process. If pause is detected, it's 
better to discard old membership messages and initiate new query then 
sending outdated view.


So there are various reasons why this is triggered, but today it's 
usually VM with overloaded host machine.




corosync [TOTEM ] A processor failed, forming new configuration.

::

I know the corosync [TOTEM ] A processor failed, forming new
configuration message is when the toten package is definitely lost.

Thanks



Regards,
  Honza




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] corosync does not reflect the node status correctly

2014-03-31 Thread Jan Friesse

Michael,

Michael Schwartzkopff napsal(a):

Hi,

we just upgraded to corosync-1.4.5-2.5 from the suse build server. On one
cluster we have the problem, that corosync-objctl does not reflect the status


So if I understand it correctly, you have multiple clusters and all of 
them was upgraded and only on one of them this bug appears?



of nodes properly. Even when the other node stops corosync we still see:

runtime.totem.mrp.srp.members.ID.status=joined



Is this consistent between nodes? I mean, ALL nodes sees already stopped 
node as joined or some of them sees that as left?


Regards,
  Honza


But the log says:

[TOTEM] A processor joined or left the membership and a new membership was
formed.

Any ideas?

Mit freundlichen Grüßen,

Michael Schwartzkopff



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Errors while compiling

2014-03-19 Thread Jan Friesse
Stephan Buchner napsal(a):
 Hm, i tried recompiling all three packages (libqb, corosync and
 pacemaker), using versions which have been marked stable by the gentoo
 project.
 
 I used the following versions: libqb   = 0.14.4
 corosync= 1.4.5
 pacemaker = 1.1.11
 
 Now i get this error, which seems at least related to the last one i got:
 
 
 CC corosync.lo
 corosync.c:38:27: fatal error: corosync/cmap.h: No such file or directory
 compilation terminated.
 make[2]: *** [corosync.lo] Fehler 1
 make[2]: Leaving directory
 `/opt/srccluster/pacemaker-Pacemaker-1.1.11/lib/cluster'
 make[1]: *** [all-recursive] Fehler 1
 make[1]: Leaving directory `/opt/srccluster/pacemaker-Pacemaker-1.1.11/lib'
 make: *** [core] Fehler 1
 
 
 Am i missing something here? I loosely followed this guide:

cmap is included in corosync 2.x. Also libqb 0.14.4 is known to be
buggy, please use 0.17.0


 http://clusterlabs.org/wiki/SourceInstall
 
 Am 17.03.2014 06:11, schrieb Andrew Beekhof:
 Its looking for cmap_handle_t which will be in one of the corosync
 headers.
 What version of corosync have you got installed?

 On 15 Mar 2014, at 12:18 am, Stephan Buchner
 buch...@linux-systeme.de wrote:

 Hm, i installed libcrmcluster1-dev and libcrmcommon2-dev on my
 debian system, still the same error :/

 Am 14.03.2014 14:07, schrieb emmanuel segura:
 maybe you are missing crm dev library


 2014-03-14 13:39 GMT+01:00 Stephan Buchner buch...@linux-systeme.de:
 Hey everyone!
 I am trying to compile pacemaker from source for some time - but i
 keep getting the same errors, despite using different versions.

 I did the following to get this:

 1. ./autogen.sh
 2. ./configure --prefix=/opt/cluster/ --disable-fatal-warnings
 3. make

 After that step i always get this error:

 http://pastebin.com/eXFmhUUD

 I get this on version 1.10, as on 1.11

 Any ideas?

 -- 

 Stephan Buchner
 buch...@linux-systeme.de

 +49 201 - 29 88 319
 +49 172 - 7 222 333

 Linux-Systeme GmbH
 Langenbergerstr. 179, 45277 Essen
 www.linux-systeme.de +49 201 - 29 88 30
 Amtsgericht Essen, HRB 14729
 Geschäftsführer Jörg Hinz


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org



 -- 
 esta es mi vida e me la vivo hasta que dios quiera


 ___
 Pacemaker mailing list:
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


 Project Home:
 http://www.clusterlabs.org

 Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

 Bugs:
 http://bugs.clusterlabs.org

 -- 

 Stephan Buchner

 buch...@linux-systeme.de


 +49 201 - 29 88 319
 +49 172 - 7 222 333

 Linux-Systeme GmbH
 Langenbergerstr. 179, 45277 Essen

 www.linux-systeme.de
   +49 201 - 29 88 30
 Amtsgericht Essen, HRB 14729
 Geschäftsführer Jörg Hinz

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-14 Thread Jan Friesse
Attila Megyeri napsal(a):
 Hi Honza,
 
 What I also found in the log related to the freeze at 12:22:26:
 
 
 Corosync main process was not scheduled for  ... Can It be the general 
 cause of the issue?
 
 
 
 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
 [10.9.1.5]:58597-[10.9.1.3]:161
 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
 [10.9.1.5]:47943-[10.9.1.3]:161
 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
 [10.9.1.5]:47943-[10.9.1.3]:161
 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
 [10.9.1.5]:59647-[10.9.1.3]:161
 
 
 Mar 13 12:22:26 ctmgr corosync[3024]:   [MAIN  ] Corosync main process was 
 not scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token 
 timeout increase.
 
 
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] The token was lost in the 
 OPERATIONAL state.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] A processor failed, forming 
 new configuration.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering GATHER state from 
 2(The token was lost in the OPERATIONAL state.).
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Creating commit token 
 because I am the rep.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Saving state aru 6a8c high 
 seq received 6a8c
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Storing new sequence id for 
 ring 7dc
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering COMMIT state.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] got commit token
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering RECOVERY state.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [0] member 10.9.1.3:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [1] member 10.9.1.41:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [2] member 10.9.1.42:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [3] member 10.9.1.71:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [4] member 10.9.1.72:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [5] member 10.9.2.11:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [6] member 10.9.2.12:
 
 
 
 
 Regards,
 Attila
 
 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: Thursday, March 13, 2014 2:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: Thursday, March 13, 2014 1:45 PM
 To: The Pacemaker cluster resource manager; Andrew Beekhof
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Hello,

 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Thursday, March 13, 2014 10:03 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 ...


 Also can you please try to set debug: on in corosync.conf and
 paste full corosync.log then?

 I set debug to on, and did a few restarts but could not
 reproduce the issue
 yet - will post the logs as soon as I manage to reproduce.


 Perfect.

 Another option you can try to set is netmtu (1200 is usually safe).

 Finally I was able to reproduce the issue.
 I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
 (not
 when node was up again).

 The corosync log with debug on is available at:
 http://pastebin.com/kTpDqqtm


 To be honest, I had to wait much longer for this reproduction as
 before,
 even though there was no change in the corosync configuration - just
 potentially some system updates. But anyway, the issue is
 unfortunately still there.
 Previously, when this issue came, cpu was at 100% on all nodes -
 this time
 only on ctmgr, which was the DC...

 I hope you can find some useful details in the log.


 Attila,
 what seems to be interesting is

 Configuration ERRORs found during PE processing.  Please run
 crm_verify -
 L
 to identify issues.

 I'm unsure how much is this problem but I'm really not pacemaker
 expert.

 Perhaps Andrew could comment on that. Any idea?



 Anyway, I have theory what may happening and it looks like related
 with IPC (and probably not related to network). But to make sure we
 will not try fixing already fixed bug, can you please build:
 - New libqb (0.17.0). There are plenty of fixes in IPC
 - Corosync 2.3.3 (already plenty IPC fixes)
 - And maybe also newer pacemaker


 I already use Corosync 2.3.3, built from source, and libqb-dev 0.16
 from Ubuntu package.
 I am currently building libqb 0.17.0, will update you on the results.

 In the meantime we had another freeze, which did not seem to be
 related to any restarts, but brought all coroync processes to 100%.
 Please check out the corosync.log, perhaps it is a different cause:
 http://pastebin.com/WMwzv0Rr


 In the meantime I will install the new libqb and send logs if we have
 further issues.

 Thank you very much for your help!

 Regards,
 Attila


 One more question:

 If I install libqb 0.17.0 from

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-14 Thread Jan Friesse
Attila Megyeri napsal(a):
 Hi Honza,
 
 What I also found in the log related to the freeze at 12:22:26:
 
 
 Corosync main process was not scheduled for  ... Can It be the general 
 cause of the issue?
 

I don't think it will cause issue you are hitting BUT keep in mind that
if corosync is not scheduled for long time, it's probably fenced by
other node. So increase timeout is vital.

Honza

 
 
 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
 [10.9.1.5]:58597-[10.9.1.3]:161
 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
 [10.9.1.5]:47943-[10.9.1.3]:161
 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
 [10.9.1.5]:47943-[10.9.1.3]:161
 Mar 13 12:22:14 ctmgr snmpd[1247]: Connection from UDP: 
 [10.9.1.5]:59647-[10.9.1.3]:161
 
 
 Mar 13 12:22:26 ctmgr corosync[3024]:   [MAIN  ] Corosync main process was 
 not scheduled for 6327.5918 ms (threshold is 4000. ms). Consider token 
 timeout increase.
 
 
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] The token was lost in the 
 OPERATIONAL state.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] A processor failed, forming 
 new configuration.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering GATHER state from 
 2(The token was lost in the OPERATIONAL state.).
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Creating commit token 
 because I am the rep.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Saving state aru 6a8c high 
 seq received 6a8c
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] Storing new sequence id for 
 ring 7dc
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering COMMIT state.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] got commit token
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] entering RECOVERY state.
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [0] member 10.9.1.3:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [1] member 10.9.1.41:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [2] member 10.9.1.42:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [3] member 10.9.1.71:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [4] member 10.9.1.72:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [5] member 10.9.2.11:
 Mar 13 12:22:26 ctmgr corosync[3024]:   [TOTEM ] TRANS [6] member 10.9.2.12:
 
 
 
 
 Regards,
 Attila
 
 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: Thursday, March 13, 2014 2:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 -Original Message-
 From: Attila Megyeri [mailto:amegy...@minerva-soft.com]
 Sent: Thursday, March 13, 2014 1:45 PM
 To: The Pacemaker cluster resource manager; Andrew Beekhof
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Hello,

 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Thursday, March 13, 2014 10:03 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 ...


 Also can you please try to set debug: on in corosync.conf and
 paste full corosync.log then?

 I set debug to on, and did a few restarts but could not
 reproduce the issue
 yet - will post the logs as soon as I manage to reproduce.


 Perfect.

 Another option you can try to set is netmtu (1200 is usually safe).

 Finally I was able to reproduce the issue.
 I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately
 (not
 when node was up again).

 The corosync log with debug on is available at:
 http://pastebin.com/kTpDqqtm


 To be honest, I had to wait much longer for this reproduction as
 before,
 even though there was no change in the corosync configuration - just
 potentially some system updates. But anyway, the issue is
 unfortunately still there.
 Previously, when this issue came, cpu was at 100% on all nodes -
 this time
 only on ctmgr, which was the DC...

 I hope you can find some useful details in the log.


 Attila,
 what seems to be interesting is

 Configuration ERRORs found during PE processing.  Please run
 crm_verify -
 L
 to identify issues.

 I'm unsure how much is this problem but I'm really not pacemaker
 expert.

 Perhaps Andrew could comment on that. Any idea?



 Anyway, I have theory what may happening and it looks like related
 with IPC (and probably not related to network). But to make sure we
 will not try fixing already fixed bug, can you please build:
 - New libqb (0.17.0). There are plenty of fixes in IPC
 - Corosync 2.3.3 (already plenty IPC fixes)
 - And maybe also newer pacemaker


 I already use Corosync 2.3.3, built from source, and libqb-dev 0.16
 from Ubuntu package.
 I am currently building libqb 0.17.0, will update you on the results.

 In the meantime we had another freeze, which did not seem to be
 related to any restarts, but brought all coroync processes to 100%.
 Please check out the corosync.log, perhaps it is a different cause:
 http://pastebin.com/WMwzv0Rr

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-13 Thread Jan Friesse
...


 Also can you please try to set debug: on in corosync.conf and paste
 full corosync.log then?

 I set debug to on, and did a few restarts but could not reproduce the issue
 yet - will post the logs as soon as I manage to reproduce.


 Perfect.

 Another option you can try to set is netmtu (1200 is usually safe).
 
 Finally I was able to reproduce the issue.
 I restarted node ctsip2 at 21:10:14, and CPU went 100% immediately (not when 
 node was up again).
 
 The corosync log with debug on is available at: http://pastebin.com/kTpDqqtm
 
 
 To be honest, I had to wait much longer for this reproduction as before, even 
 though there was no change in the corosync configuration - just potentially 
 some system updates. But anyway, the issue is unfortunately still there.
 Previously, when this issue came, cpu was at 100% on all nodes - this time 
 only on ctmgr, which was the DC...
 
 I hope you can find some useful details in the log.
 

Attila,
what seems to be interesting is

Configuration ERRORs found during PE processing.  Please run crm_verify
-L to identify issues.

I'm unsure how much is this problem but I'm really not pacemaker expert.

Anyway, I have theory what may happening and it looks like related with
IPC (and probably not related to network). But to make sure we will not
try fixing already fixed bug, can you please build:
- New libqb (0.17.0). There are plenty of fixes in IPC
- Corosync 2.3.3 (already plenty IPC fixes)
- And maybe also newer pacemaker

I know you were not very happy using hand-compiled sources, but please
give them at least a try.

Thanks,
  Honza

 Thanks,
 Attila
 
 
 

 Regards,
   Honza


 There are also a few things that might or might not be related:

 1) Whenever I want to edit the configuration with crm configure edit,

...

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Jan Friesse
Attila Megyeri napsal(a):
 
 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 10:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 12 Mar 2014, at 1:54 am, Attila Megyeri amegy...@minerva-soft.com
 wrote:


 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 12:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:54 pm, Attila Megyeri amegy...@minerva-soft.com
 wrote:

 Thanks for the quick response!

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Friday, March 07, 2014 3:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:31 am, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:

 Hello,

 We have a strange issue with Corosync/Pacemaker.
 From time to time, something unexpected happens and suddenly the
 crm_mon output remains static.
 When I check the cpu usage, I see that one of the cores uses 100%
 cpu, but
 cannot actually match it to either the corosync or one of the
 pacemaker processes.

 In such a case, this high CPU usage is happening on all 7 nodes.
 I have to manually go to each node, stop pacemaker, restart
 corosync, then
 start pacemeker. Stoping pacemaker and corosync does not work in
 most of the cases, usually a kill -9 is needed.

 Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.

 Using udpu as transport, two rings on Gigabit ETH, rro_mode passive.

 Logs are usually flooded with CPG related messages, such as:

 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:   
 Sent 0
 CPG
 messages  (1 remaining, last=8): Try again (6)

 OR

 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (

 That is usually a symptom of corosync getting into a horribly
 confused
 state.
 Version? Distro? Have you checked for an update?
 Odd that the user of all that CPU isn't showing up though.



 As I wrote I use Ubuntu trusty, the exact package versions are:

 corosync 2.3.0-1ubuntu5
 pacemaker 1.1.10+git20130802-1ubuntu2

 Ah sorry, I seem to have missed that part.


 There are no updates available. The only option is to install from
 sources,
 but that would be very difficult to maintain and I'm not sure I would
 get rid of this issue.

 What do you recommend?

 The same thing as Lars, or switch to a distro that stays current with
 upstream (git shows 5 newer releases for that branch since it was
 released 3 years ago).
 If you do build from source, its probably best to go with v1.4.6

 Hm, I am a bit confused here. We are using 2.3.0,

 I swapped the 2 for a 1 somehow. A bit distracted, sorry.
 
 I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still the 
 same issue - after some time CPU gets to 100%, and the corosync log is 
 flooded with messages like:
 
 Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:Sent 
 0 CPG messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 
 0 CPG messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:Sent 
 0 CPG messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 
 0 CPG messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 
 0 CPG messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:57 [4798] ctdb2   crmd: info: crm_cs_flush:Sent 
 0 CPG messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:Sent 
 0 CPG messages  (48 remaining, last=3671): Try again (6)
 
 

Attila,

 Shall I try to downgrade to 1.4.6? What is the difference in that build? Or 
 where should I start troubleshooting?

First of all, 1.x branch (flatiron) is maintained so even it looks like
a old version, it's quite a new. It contains more or less only 

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Jan Friesse
Attila Megyeri napsal(a):
 Hello Jan,
 
 Thank you very much for your help so far.
 
 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 9:51 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Attila Megyeri napsal(a):

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 10:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 12 Mar 2014, at 1:54 am, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:


 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 12:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:54 pm, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:

 Thanks for the quick response!

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Friday, March 07, 2014 3:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:31 am, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:

 Hello,

 We have a strange issue with Corosync/Pacemaker.
 From time to time, something unexpected happens and suddenly
 the
 crm_mon output remains static.
 When I check the cpu usage, I see that one of the cores uses
 100% cpu, but
 cannot actually match it to either the corosync or one of the
 pacemaker processes.

 In such a case, this high CPU usage is happening on all 7 nodes.
 I have to manually go to each node, stop pacemaker, restart
 corosync, then
 start pacemeker. Stoping pacemaker and corosync does not work in
 most of the cases, usually a kill -9 is needed.

 Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.

 Using udpu as transport, two rings on Gigabit ETH, rro_mode
 passive.

 Logs are usually flooded with CPG related messages, such as:

 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
   Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush: 
   Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
   Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush: 
   Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)

 OR

 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
   Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
   Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:  
   Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (

 That is usually a symptom of corosync getting into a horribly
 confused
 state.
 Version? Distro? Have you checked for an update?
 Odd that the user of all that CPU isn't showing up though.



 As I wrote I use Ubuntu trusty, the exact package versions are:

 corosync 2.3.0-1ubuntu5
 pacemaker 1.1.10+git20130802-1ubuntu2

 Ah sorry, I seem to have missed that part.


 There are no updates available. The only option is to install from
 sources,
 but that would be very difficult to maintain and I'm not sure I
 would get rid of this issue.

 What do you recommend?

 The same thing as Lars, or switch to a distro that stays current
 with upstream (git shows 5 newer releases for that branch since it
 was released 3 years ago).
 If you do build from source, its probably best to go with v1.4.6

 Hm, I am a bit confused here. We are using 2.3.0,

 I swapped the 2 for a 1 somehow. A bit distracted, sorry.

 I upgraded all nodes to 2.3.3 and first it seemed a bit better, but still 
 the
 same issue - after some time CPU gets to 100%, and the corosync log is
 flooded with messages like:

 Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:
 Sent 0 CPG
 messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:
 Sent 0 CPG
 messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:57 [4798] ctdb2   crmd: info: crm_cs_flush:
 Sent 0 CPG
 messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages

Re: [Pacemaker] Pacemaker/corosync freeze

2014-03-12 Thread Jan Friesse
Attila Megyeri napsal(a):
 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 2:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Attila Megyeri napsal(a):
 Hello Jan,

 Thank you very much for your help so far.

 -Original Message-
 From: Jan Friesse [mailto:jfrie...@redhat.com]
 Sent: Wednesday, March 12, 2014 9:51 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze

 Attila Megyeri napsal(a):

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 10:27 PM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 12 Mar 2014, at 1:54 am, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:


 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Tuesday, March 11, 2014 12:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:54 pm, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:

 Thanks for the quick response!

 -Original Message-
 From: Andrew Beekhof [mailto:and...@beekhof.net]
 Sent: Friday, March 07, 2014 3:48 AM
 To: The Pacemaker cluster resource manager
 Subject: Re: [Pacemaker] Pacemaker/corosync freeze


 On 7 Mar 2014, at 5:31 am, Attila Megyeri
 amegy...@minerva-soft.com
 wrote:

 Hello,

 We have a strange issue with Corosync/Pacemaker.
 From time to time, something unexpected happens and
 suddenly
 the
 crm_mon output remains static.
 When I check the cpu usage, I see that one of the cores uses
 100% cpu, but
 cannot actually match it to either the corosync or one of the
 pacemaker processes.

 In such a case, this high CPU usage is happening on all 7 nodes.
 I have to manually go to each node, stop pacemaker, restart
 corosync, then
 start pacemeker. Stoping pacemaker and corosync does not work
 in most of the cases, usually a kill -9 is needed.

 Using corosync 2.3.0, pacemaker 1.1.10 on Ubuntu trusty.

 Using udpu as transport, two rings on Gigabit ETH, rro_mode
 passive.

 Logs are usually flooded with CPG related messages, such as:

 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:49 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)
 Mar 06 18:10:50 [1316] ctsip1   crmd: info: crm_cs_flush:
 Sent
 0
 CPG
 messages  (1 remaining, last=8): Try again (6)

 OR

 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (
 Mar 06 17:46:24 [1341] ctdb1cib: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (1 remaining, last=10933): Try again (

 That is usually a symptom of corosync getting into a horribly
 confused
 state.
 Version? Distro? Have you checked for an update?
 Odd that the user of all that CPU isn't showing up though.



 As I wrote I use Ubuntu trusty, the exact package versions are:

 corosync 2.3.0-1ubuntu5
 pacemaker 1.1.10+git20130802-1ubuntu2

 Ah sorry, I seem to have missed that part.


 There are no updates available. The only option is to install
 from sources,
 but that would be very difficult to maintain and I'm not sure I
 would get rid of this issue.

 What do you recommend?

 The same thing as Lars, or switch to a distro that stays current
 with upstream (git shows 5 newer releases for that branch since
 it was released 3 years ago).
 If you do build from source, its probably best to go with v1.4.6

 Hm, I am a bit confused here. We are using 2.3.0,

 I swapped the 2 for a 1 somehow. A bit distracted, sorry.

 I upgraded all nodes to 2.3.3 and first it seemed a bit better, but
 still the
 same issue - after some time CPU gets to 100%, and the corosync log
 is flooded with messages like:

 Mar 12 07:36:55 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:55 [4798] ctdb2   crmd: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:56 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:56 [4798] ctdb2   crmd: info: crm_cs_flush:
 Sent 0
 CPG
 messages  (51 remaining, last=3995): Try again (6)
 Mar 12 07:36:57 [4793] ctdb2cib: info: crm_cs_flush:
 Sent 0 CPG
 messages  (48 remaining, last=3671): Try again (6)
 Mar 12 07:36:57 [4798] ctdb2   crmd

Re: [Pacemaker] [corosync] corosync Segmentation fault.

2014-02-26 Thread Jan Friesse
Andrey,
what version of corosync and libqb are you using?

Can you please attach output from valgrind (and gdb backtrace)?

Thanks,
  Honza

Andrey Groshev napsal(a):
 Hi, ALL.
 Something I already confused, or after updating any package or myself 
 something broke, 
 but call corosycn killed by segmentation fault signal.
 I correctly understood that does not link the library libqb ?
 
 .
 
 (gdb) n
 [New Thread 0x74b2b700 (LWP 9014)]
 1266if ((flock_err = corosync_flock (corosync_lock_file, getpid 
 ())) != COROSYNC_DONE_EXIT) {
 (gdb) n
 1280totempg_initialize (
 (gdb) n
 1284totempg_service_ready_register (
 (gdb) n
 1287totempg_groups_initialize (
 (gdb) n
 1292totempg_groups_join (
 (gdb) n
 1307schedwrk_init (
 (gdb) n
 1314qb_loop_run (corosync_poll_handle);
 (gdb) n
 
 Program received signal SIGSEGV, Segmentation fault.
 0x771e581c in free () from /lib64/libc.so.6
 (gdb) 
 ___
 discuss mailing list
 disc...@corosync.org
 http://lists.corosync.org/mailman/listinfo/discuss
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [corosync] corosync Segmentation fault.

2014-02-26 Thread Jan Friesse
Andrey,
can you please give a try to patch [PATCH] votequorum: Properly
initialize atb and atb_string which I've sent to ML (it should be there
soon)?

Thanks,
  Honza

Andrey Groshev napsal(a):
 
 
 26.02.2014, 12:11, Jan Friesse jfrie...@redhat.com:
 Andrey,
 what version of corosync and libqb are you using?

 Can you please attach output from valgrind (and gdb backtrace)?
 ,,,
 1314qb_loop_run (corosync_poll_handle);
 (gdb) n
 
 Program received signal SIGSEGV, Segmentation fault.
 0x771e581c in free () from /lib64/libc.so.6
 (gdb) bt
 #0  0x771e581c in free () from /lib64/libc.so.6
 #1  0x77fe77ec in votequorum_readconfig (runtime=value optimized 
 out) at votequorum.c:1293
 #2  0x77fe8300 in votequorum_exec_init_fn (api=value optimized out) 
 at votequorum.c:2115
 #3  0x77feeb7b in corosync_service_link_and_init 
 (corosync_api=0x78200980, service=0x78200760) at service.c:139
 #4  0x77fe4197 in votequorum_init (api=0x78200980, 
 q_set_quorate_fn=0x77fda5b0 quorum_api_set_quorum) at votequorum.c:2255
 #5  0x77fda42f in quorum_exec_init_fn (api=0x78200980) at 
 vsf_quorum.c:280
 #6  0x77feeb7b in corosync_service_link_and_init 
 (corosync_api=0x78200980, service=0x78200c40) at service.c:139
 #7  0x77feede9 in corosync_service_defaults_link_and_init 
 (corosync_api=0x78200980) at service.c:348
 #8  0x77fe9621 in main_service_ready () at main.c:978
 #9  0x77b90b0f in main_iface_change_fn (context=0x77f73010, 
 iface_addr=value optimized out, iface_no=0) at totemsrp.c:4672
 #10 0x77b8a734 in timer_function_netif_check_timeout 
 (data=0x78304f10) at totemudp.c:672
 #11 0x777289f8 in ?? () from /usr/lib64/libqb.so.0
 #12 0x77727016 in qb_loop_run () from /usr/lib64/libqb.so.0
 #13 0x77fea930 in main (argc=value optimized out, argv=value 
 optimized out, envp=value optimized out) at main.c:1314
 
 Unfortunately, I have not yet used a valgrind. 
 Or hangs, or fast end with :
 
 # valgrind /usr/sbin/corosync -f
 ==2137== Memcheck, a memory error detector
 ==2137== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
 ==2137== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
 ==2137== Command: /usr/sbin/corosync -f
 ==2137== 
 ==2137== 
 ==2137== HEAP SUMMARY:
 ==2137== in use at exit: 29,876 bytes in 193 blocks
 ==2137==   total heap usage: 890 allocs, 697 frees, 100,824 bytes allocated
 ==2137== 
 ==2137== LEAK SUMMARY:
 ==2137==definitely lost: 0 bytes in 0 blocks
 ==2137==indirectly lost: 0 bytes in 0 blocks
 ==2137==  possibly lost: 539 bytes in 22 blocks
 ==2137==still reachable: 29,337 bytes in 171 blocks
 ==2137== suppressed: 0 bytes in 0 blocks
 ==2137== Rerun with --leak-check=full to see details of leaked memory
 ==2137== 
 ==2137== For counts of detected and suppressed errors, rerun with: -v
 ==2137== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 12 from 6)
 
 Now read manual about valgrind.
 

 Thanks,
   Honza

 Andrey Groshev napsal(a):

  Hi, ALL.
  Something I already confused, or after updating any package or myself 
 something broke,
  but call corosycn killed by segmentation fault signal.
  I correctly understood that does not link the library libqb ?

  .

  (gdb) n
  [New Thread 0x74b2b700 (LWP 9014)]
  1266if ((flock_err = corosync_flock (corosync_lock_file, 
 getpid ())) != COROSYNC_DONE_EXIT) {
  (gdb) n
  1280totempg_initialize (
  (gdb) n
  1284totempg_service_ready_register (
  (gdb) n
  1287totempg_groups_initialize (
  (gdb) n
  1292totempg_groups_join (
  (gdb) n
  1307schedwrk_init (
  (gdb) n
  1314qb_loop_run (corosync_poll_handle);
  (gdb) n

  Program received signal SIGSEGV, Segmentation fault.
  0x771e581c in free () from /lib64/libc.so.6
  (gdb)
  ___
  discuss mailing list
  disc...@corosync.org
  http://lists.corosync.org/mailman/listinfo/discuss


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [corosync] corosync Segmentation fault.

2014-02-26 Thread Jan Friesse
Andrey Groshev napsal(a):
 
 
 26.02.2014, 16:11, Jan Friesse jfrie...@redhat.com:
 Andrey,
 can you please give a try to patch [PATCH] votequorum: Properly
 initialize atb and atb_string which I've sent to ML (it should be there
 soon)?
 
 Yes. Service is running. Thanks.
 
 # corosync-quorumtool -l
 
 Membership information
 --
 Nodeid  Votes Name
  172793104  1 dev-cluster2-node1 (local)
 
 
 Continue tests.
 In messages logs I see
 
 Feb 26 17:33:55 dev-cluster2-node1 qb_blackbox[15480]: [error] trying to recv 
 chunk of size 1024 but 4030249 available
 Feb 26 17:33:55 dev-cluster2-node1 qb_blackbox[15497]: [error] trying to recv 
 chunk of size 1024 but 40489 available
 Feb 26 17:33:55 dev-cluster2-node1 qb_blackbox[15514]: [error] Corrupt 
 blackbox: File header hash (436212587) does not match calculated hash 
 (-1660939413)
 Feb 26 17:33:55 dev-cluster2-node1 qb_blackbox[15531]: [error] Corrupt 
 blackbox: File header hash (8328043) does not match calculated hash 
 (-905964693)
 Feb 26 17:33:55 dev-cluster2-node1 qb_blackbox[15548]: [error] Corrupt 
 blackbox: File header hash (12651) does not match calculated hash (21972)
 .
 
 At this time build libqb. It tests or real errors?
 

Looks more like build tests.

Honza

 
 Thanks,
   Honza

 Andrey Groshev napsal(a):

  26.02.2014, 12:11, Jan Friesse jfrie...@redhat.com:
  Andrey,
  what version of corosync and libqb are you using?

  Can you please attach output from valgrind (and gdb backtrace)?
  ,,,
  1314qb_loop_run (corosync_poll_handle);
  (gdb) n

  Program received signal SIGSEGV, Segmentation fault.
  0x771e581c in free () from /lib64/libc.so.6
  (gdb) bt
  #0  0x771e581c in free () from /lib64/libc.so.6
  #1  0x77fe77ec in votequorum_readconfig (runtime=value optimized 
 out) at votequorum.c:1293
  #2  0x77fe8300 in votequorum_exec_init_fn (api=value optimized 
 out) at votequorum.c:2115
  #3  0x77feeb7b in corosync_service_link_and_init 
 (corosync_api=0x78200980, service=0x78200760) at service.c:139
  #4  0x77fe4197 in votequorum_init (api=0x78200980, 
 q_set_quorate_fn=0x77fda5b0 quorum_api_set_quorum) at 
 votequorum.c:2255
  #5  0x77fda42f in quorum_exec_init_fn (api=0x78200980) at 
 vsf_quorum.c:280
  #6  0x77feeb7b in corosync_service_link_and_init 
 (corosync_api=0x78200980, service=0x78200c40) at service.c:139
  #7  0x77feede9 in corosync_service_defaults_link_and_init 
 (corosync_api=0x78200980) at service.c:348
  #8  0x77fe9621 in main_service_ready () at main.c:978
  #9  0x77b90b0f in main_iface_change_fn (context=0x77f73010, 
 iface_addr=value optimized out, iface_no=0) at totemsrp.c:4672
  #10 0x77b8a734 in timer_function_netif_check_timeout 
 (data=0x78304f10) at totemudp.c:672
  #11 0x777289f8 in ?? () from /usr/lib64/libqb.so.0
  #12 0x77727016 in qb_loop_run () from /usr/lib64/libqb.so.0
  #13 0x77fea930 in main (argc=value optimized out, argv=value 
 optimized out, envp=value optimized out) at main.c:1314

  Unfortunately, I have not yet used a valgrind.
  Or hangs, or fast end with :

  # valgrind /usr/sbin/corosync -f
  ==2137== Memcheck, a memory error detector
  ==2137== Copyright (C) 2002-2012, and GNU GPL'd, by Julian Seward et al.
  ==2137== Using Valgrind-3.8.1 and LibVEX; rerun with -h for copyright info
  ==2137== Command: /usr/sbin/corosync -f
  ==2137==
  ==2137==
  ==2137== HEAP SUMMARY:
  ==2137== in use at exit: 29,876 bytes in 193 blocks
  ==2137==   total heap usage: 890 allocs, 697 frees, 100,824 bytes allocated
  ==2137==
  ==2137== LEAK SUMMARY:
  ==2137==definitely lost: 0 bytes in 0 blocks
  ==2137==indirectly lost: 0 bytes in 0 blocks
  ==2137==  possibly lost: 539 bytes in 22 blocks
  ==2137==still reachable: 29,337 bytes in 171 blocks
  ==2137== suppressed: 0 bytes in 0 blocks
  ==2137== Rerun with --leak-check=full to see details of leaked memory
  ==2137==
  ==2137== For counts of detected and suppressed errors, rerun with: -v
  ==2137== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 12 from 6)

  Now read manual about valgrind.
  Thanks,
Honza

  Andrey Groshev napsal(a):
   Hi, ALL.
   Something I already confused, or after updating any package or myself 
 something broke,
   but call corosycn killed by segmentation fault signal.
   I correctly understood that does not link the library libqb ?

   .

   (gdb) n
   [New Thread 0x74b2b700 (LWP 9014)]
   1266if ((flock_err = corosync_flock (corosync_lock_file, 
 getpid ())) != COROSYNC_DONE_EXIT) {
   (gdb) n
   1280totempg_initialize (
   (gdb) n
   1284totempg_service_ready_register (
   (gdb) n
   1287totempg_groups_initialize (
   (gdb) n
   1292totempg_groups_join (
   (gdb) n
   1307schedwrk_init

Re: [Pacemaker] Multicast pitfalls? corosync [TOTEM ] Retransmit List:

2014-02-14 Thread Jan Friesse

Beo,
do you experiencing cluster split? If answer is no, then you don't need 
to do anything. Maybe network buffer is just filled. But, if answer is 
yes, try reduce mtu size (netmtu in configuration) to value like 1000.


Regards,
  Honza

Beo Banks napsal(a):

Hi,

i have a fresh 2 node cluster (kvm host1 - guest = nodeA | kvm host2 -
guest = NodeB) and it seems to work but from time to time i have a lot of
errors like

Feb 13 13:41:04 corosync [TOTEM ] Retransmit List: 196 198 184 185 186 187
188 189 18a 18b 18c 18d 18e 18f 190 191 192 193 194 195 197 199
Feb 13 13:41:04 corosync [TOTEM ] Retransmit List: 197 199 184 185 186 187
188 189 18a 18b 18c 18d 18e 18f 190 191 192 193 194 195 196 198
Feb 13 13:41:04 corosync [TOTEM ] Retransmit List: 196 198 184 185 186 187
188 189 18a 18b 18c 18d 18e 18f 190 191 192 193 194 195 197 199
Feb 13 13:41:04 corosync [TOTEM ] Retransmit List: 197 199 184 185 186 187
188 189 18a 18b 18c 18d 18e 18f 190 191 192 193 194 195 196 198
Feb 13 13:41:04 corosync [TOTEM ] Retransmit List: 196 198 184 185 186 187
188 189 18a 18b 18c 18d 18e 18f 190 191 192 193 194 195 197 199
Feb 13 13:41:04 corosync [TOTEM ] Retransmit List: 197 199 184 185 186 187
188 189 18a 18b 18c 18d 18e 18f 190 191 192 193 194 195 196 198
i used the newest rhel 6.5 version.

i have also already try solve the issue with
echo 1  /sys/class/net/virbr0/bridge/multicast_querier (host system)
but no chance...

i have disable iptables,selinux..same issue

how can solve it?

thanks beo



___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org




___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again

2013-12-09 Thread Jan Friesse
Brian J. Murrell (brian) napsal(a):
 I seem to have another instance where pacemaker fails to exit at the end
 of a shutdown.  Here's the log from the start of the service pacemaker
 stop:
 
 Dec  3 13:00:39 wtm-60vm8 crmd[14076]:   notice: do_state_transition: State 
 transition S_POLICY_ENGINE - S_TRANSITION_ENGINE [ input=I_PE_SUCCESS 
 cause=C_IPC_MESSAGE origin=handle_response ]
 Dec  3 13:00:39 wtm-60vm8 crmd[14076]: info: do_te_invoke: Processing 
 graph 19 (ref=pe_calc-dc-1386093636-83) derived from 
 /var/lib/pengine/pe-input-40.bz2

...

 Dec  3 13:05:08 wtm-60vm8 pacemakerd[14067]:error: send_cpg_message: 
 Sending message via cpg FAILED: (rc=6) Try again
 Dec  3 13:05:08 wtm-60vm8 pacemakerd[14067]:   notice: pcmk_shutdown_worker: 
 Shutdown complete
 Dec  3 13:05:08 wtm-60vm8 pacemakerd[14067]: info: main: Exiting 
 pacemakerd
 
 These types of shutdown failure issues seem to always end up with the series 
 of:
 
 error: send_cpg_message: Sending message via cpg FAILED: (rc=6) Try again
 
 Even though the above messages seem to indicate that pacemaker did
 finally exit it did not as can be seen looking at the process table:
 
 14032 ?Ssl0:01 corosync
 14067 ?S  0:00 pacemakerd
 14071 ?Ss 0:00  \_ /usr/libexec/pacemaker/cib
 
 So what does this sending message via cpg FAILED: (rc=6) mean exactly?
 

Error 6 error means try again. This is happening ether if corosync is
overloaded or creating new membership. Please take a look to
/var/log/cluster/corosync.log if you see something strange there (+ make
sure you have newest corosync).

Regards,
  Honza

 Or any other ideas what happened to this shutdown to cause it to fail/hang 
 ultimately?
 
 Cheers,
 b.
 
 
 
 
 
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Network outage debugging

2013-11-13 Thread Jan Friesse
Andrew Beekhof napsal(a):
 
 On 13 Nov 2013, at 11:49 am, Sean Lutner s...@rentul.net wrote:
 
 
 
 On Nov 12, 2013, at 7:33 PM, Andrew Beekhof
 and...@beekhof.net wrote:
 
 
 On 13 Nov 2013, at 11:22 am, Sean Lutner s...@rentul.net
 wrote:
 
 
 
 On Nov 12, 2013, at 6:01 PM, Andrew Beekhof
 and...@beekhof.net wrote:
 
 
 On 13 Nov 2013, at 6:10 am, Sean Lutner s...@rentul.net
 wrote:
 
 The folks testing the cluster I've been building have run
 a script which blocks all traffic except SSH on one node
 of the cluster for 15 seconds to mimic a network failure.
 During this time, the network being down seems to cause
 some odd behavior from pacemaker resulting in it dying.
 
 The cluster is two nodes and running four custom
 resources on EC2 instances. The OS is CentOS 6.4 with the
 config below:
 
 I've attached the /var/log/messages and
 /var/log/cluster/corosync.log from the time period during
 the test. I've having some difficulty in piecing together
 what happened and am hoping someone can shed some light
 on the problem. Any indications why pacemaker is dying on
 that node?
 
 Because corosync is dying underneath it:
 
 Nov 09 14:51:49 [942] ip-10-50-3-251cib:error:
 send_ais_text:Sending message 28 via cpg: FAILED
 (rc=2): Library error: Connection timed out (110) Nov 09
 14:51:49 [942] ip-10-50-3-251cib:error:
 pcmk_cpg_dispatch:Connection to the CPG API failed: 2 
 Nov 09 14:51:49 [942] ip-10-50-3-251cib:error:
 cib_ais_destroy:Corosync connection lost!  Exiting. Nov
 09 14:51:49 [942] ip-10-50-3-251cib: info:
 terminate_cib:cib_ais_destroy: Exiting fast...
 
 Is that the expected behavior?
 
 It is expected behaviour when corosync dies.  Ideally corosync
 wouldn't die though.
 
 What other debugging can I do to try to find out why corosync
 died?
 
 There are various logging setting that may help. CC'ing Jan to see
 if he has any suggestions.
 

If corosync really died corosync-fplay output (right after corosync
death) and coredump are most useful.

Regards,
  Honza

 
 Thanks
 
 
 Is it because the DC was the other node?
 
 No.
 
 
 I did notice that there was an attempted fence operation but
 it didn't look successful.
 
 
 
 
 
 [root@ip-10-50-3-122 ~]# pcs config Corosync Nodes:
 
 Pacemaker Nodes: ip-10-50-3-122 ip-10-50-3-251
 
 Resources: Resource: ClusterEIP_54.215.143.166
 (provider=pacemaker type=EIP class=ocf) Attributes:
 first_network_interface_id=eni-e4e0b68c
 second_network_interface_id=eni-35f9af5d
 first_private_ip=10.50.3.191 second_private_ip=10.50.3.91
 eip=54.215.143.166 alloc_id=eipalloc-376c3c5f interval=5s
  Operations: monitor interval=5s Clone:
 EIP-AND-VARNISH-clone Group: EIP-AND-VARNISH Resource:
 Varnish (provider=redhat type=varnish.sh class=ocf) 
 Operations: monitor interval=5s Resource: Varnishlog
 (provider=redhat type=varnishlog.sh class=ocf) 
 Operations: monitor interval=5s Resource: Varnishncsa
 (provider=redhat type=varnishncsa.sh class=ocf) 
 Operations: monitor interval=5s Resource: ec2-fencing
 (type=fence_ec2 class=stonith) Attributes:
 ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list
 pcmk_host_list=HA01 HA02 Operations: monitor
 start-delay=30s interval=0 timeout=150s
 
 Location Constraints: Ordering Constraints: 
 ClusterEIP_54.215.143.166 then Varnish Varnish then
 Varnishlog Varnishlog then Varnishncsa Colocation
 Constraints: Varnish with ClusterEIP_54.215.143.166 
 Varnishlog with Varnish Varnishncsa with Varnishlog
 
 Cluster Properties: dc-version: 1.1.8-7.el6-394e906 
 cluster-infrastructure: cman last-lrm-refresh:
 1384196963 no-quorum-policy: ignore stonith-enabled:
 true
 
 net-failure-messages-110913.outnet-failure-corosync-110913.out

 
___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
 Bugs: http://bugs.clusterlabs.org
 
 ___ Pacemaker
 mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf 
 Bugs: http://bugs.clusterlabs.org
 
 ___ Pacemaker
 mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
 http://bugs.clusterlabs.org
 
 ___ Pacemaker
 mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker
 
 Project Home: http://www.clusterlabs.org Getting started:
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs:
 

Re: [Pacemaker] Network outage debugging

2013-11-13 Thread Jan Friesse
Sean Lutner napsal(a):
 
 On Nov 13, 2013, at 3:15 AM, Jan Friesse jfrie...@redhat.com wrote:
 
 Andrew Beekhof napsal(a):

 On 13 Nov 2013, at 11:49 am, Sean Lutner s...@rentul.net wrote:



 On Nov 12, 2013, at 7:33 PM, Andrew Beekhof
 and...@beekhof.net wrote:


 On 13 Nov 2013, at 11:22 am, Sean Lutner s...@rentul.net
 wrote:



 On Nov 12, 2013, at 6:01 PM, Andrew Beekhof
 and...@beekhof.net wrote:


 On 13 Nov 2013, at 6:10 am, Sean Lutner s...@rentul.net
 wrote:

 The folks testing the cluster I've been building have run
 a script which blocks all traffic except SSH on one node
 of the cluster for 15 seconds to mimic a network failure.
 During this time, the network being down seems to cause
 some odd behavior from pacemaker resulting in it dying.

 The cluster is two nodes and running four custom
 resources on EC2 instances. The OS is CentOS 6.4 with the
 config below:

 I've attached the /var/log/messages and
 /var/log/cluster/corosync.log from the time period during
 the test. I've having some difficulty in piecing together
 what happened and am hoping someone can shed some light
 on the problem. Any indications why pacemaker is dying on
 that node?

 Because corosync is dying underneath it:

 Nov 09 14:51:49 [942] ip-10-50-3-251cib:error:
 send_ais_text:Sending message 28 via cpg: FAILED
 (rc=2): Library error: Connection timed out (110) Nov 09
 14:51:49 [942] ip-10-50-3-251cib:error:
 pcmk_cpg_dispatch:Connection to the CPG API failed: 2 
 Nov 09 14:51:49 [942] ip-10-50-3-251cib:error:
 cib_ais_destroy:Corosync connection lost!  Exiting. Nov
 09 14:51:49 [942] ip-10-50-3-251cib: info:
 terminate_cib:cib_ais_destroy: Exiting fast...

 Is that the expected behavior?

 It is expected behaviour when corosync dies.  Ideally corosync
 wouldn't die though.

 What other debugging can I do to try to find out why corosync
 died?

 There are various logging setting that may help. CC'ing Jan to see
 if he has any suggestions.


 If corosync really died corosync-fplay output (right after corosync
 death) and coredump are most useful.

 Regards,
  Honza
 
 So the process to collect this would be:
 
 - Run the test
 - Watch the logs for corosync to die

 - Run corosync-fplay and capture the output (will corosync-fplay  file.out 
 suffice?)

Yes. Usually, file is quite large, so gzip/xz is good idea.

 - Capture a core dump from corosync 
 
 How do I capture the core dump? Is it something that has to be enabled in the 
 /etc/corosync/corosync.conf file first and then run the tests? I've not done 
 this in the past.

This really depends. Do you have abrt enabled? If so, core is processed
via abrt. (Way how to find out if abrt is running is to look to
kernel.core_pattern sysctl. There is something different then classic
value core).

If you do not have abrt enabled, you must make sure to enable core
dumps. When executing corosync via cman, it should be enabled
automatically (start_global function does ulimit -c unlimited). If you
are using corosync itself, create file /etc/default/corosync with
content ulimit -c unlimited.

Coredumps are stored in /var/lib/corosync/core.* (maybe you have already
some of them there, so just take a look).

Now, please install corosynclib-devel package and use
http://stackoverflow.com/questions/5115613/core-dump-file-analysis

Important part is to execute bt (or even better, thread apply all bt)
and send output from this command.

Regards,
  Honza


 Thanks
 


 Thanks


 Is it because the DC was the other node?

 No.


 I did notice that there was an attempted fence operation but
 it didn't look successful.





 [root@ip-10-50-3-122 ~]# pcs config Corosync Nodes:

 Pacemaker Nodes: ip-10-50-3-122 ip-10-50-3-251

 Resources: Resource: ClusterEIP_54.215.143.166
 (provider=pacemaker type=EIP class=ocf) Attributes:
 first_network_interface_id=eni-e4e0b68c
 second_network_interface_id=eni-35f9af5d
 first_private_ip=10.50.3.191 second_private_ip=10.50.3.91
 eip=54.215.143.166 alloc_id=eipalloc-376c3c5f interval=5s
 Operations: monitor interval=5s Clone:
 EIP-AND-VARNISH-clone Group: EIP-AND-VARNISH Resource:
 Varnish (provider=redhat type=varnish.sh class=ocf) 
 Operations: monitor interval=5s Resource: Varnishlog
 (provider=redhat type=varnishlog.sh class=ocf) 
 Operations: monitor interval=5s Resource: Varnishncsa
 (provider=redhat type=varnishncsa.sh class=ocf) 
 Operations: monitor interval=5s Resource: ec2-fencing
 (type=fence_ec2 class=stonith) Attributes:
 ec2-home=/opt/ec2-api-tools pcmk_host_check=static-list
 pcmk_host_list=HA01 HA02 Operations: monitor
 start-delay=30s interval=0 timeout=150s

 Location Constraints: Ordering Constraints: 
 ClusterEIP_54.215.143.166 then Varnish Varnish then
 Varnishlog Varnishlog then Varnishncsa Colocation
 Constraints: Varnish with ClusterEIP_54.215.143.166 
 Varnishlog with Varnish Varnishncsa with Varnishlog

 Cluster Properties: dc-version: 1.1.8-7.el6-394e906 
 cluster

Re: [Pacemaker] Simple installation Pacemaker + CMAN + fence-agents

2013-11-10 Thread Jan Friesse
Andrew Beekhof napsal(a):
 Something seems very wrong with this at the corosync level.
 Even fenced and the dlm are having issues.
 
 Jan: Could this be firewall related?

Yes. This can be ether firewall on mcast issue. I would recommend to
turn off firewall completely (for testing). If this doesn't help, try
omping for multicast test.

Honza

 
 On 27 Sep 2013, at 10:44 pm, Bartłomiej Wójcik 
 bartlomiej.woj...@turbineam.com wrote:
 
 W dniu 2013-09-27 04:26, Andrew Beekhof pisze:
 On 26/09/2013, at 8:35 PM, Bartłomiej Wójcik 
 bartlomiej.woj...@turbineam.com
  wrote:


 Hello,

 I install Pacemaker in accordance with 
 http://clusterlabs.org/quickstart-ubuntu.html
  on Ubuntu 13.04 two nodes changing only the IP addresses.

 /etc/cluster/cluster.conf:

 ?xml version=1.0?
 cluster config_version=1 name=pacemaker1
 logging debug=off/
 clusternodes
 clusternode name=fmpgpool4 nodeid=1
 fence
 method name=pcmk-redirect
 device name=pcmk port=fmpgpool4/
 /method
 /fence
 /clusternode
 clusternode name=fmpgpool5 nodeid=2
 fence
 method name=pcmk-redirect
 device name=pcmk port=fmpgpool5/
 /method
 /fence
 /clusternode
 /clusternodes
 fencedevices
 fencedevice name=pcmk agent=fence_pcmk/
 /fencedevices
 /cluster
 

 gets only the server: 
ps -ef|grep pacemaker


 pacemakerd 

 What do the logs from pacemakerd say?



and nothing more


 I try to do:
crm configure property stonith-enabled=false

 and gets:
Signon to CIB failed: connection failed
Init failed, could not perform requested operations
ERROR: cannot parse xml: no element found: line 1, column 0
ERROR: No CIB!


 I don't know what could be wrong.


 Regards!



 ___
 Pacemaker mailing list: 
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


 Project Home: 
 http://www.clusterlabs.org

 Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

 Bugs: 
 http://bugs.clusterlabs.org


 ___
 Pacemaker mailing list: 
 Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker


 Project Home: 
 http://www.clusterlabs.org

 Getting started: 
 http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf

 Bugs: 
 http://bugs.clusterlabs.org

 Hello,

 corosync.log:

 Sep 26 11:14:50 corosync [MAIN  ] Corosync Cluster Engine ('1.4.4'): started 
 and ready to provide service.
 Sep 26 11:14:50 corosync [MAIN  ] Corosync built-in features: nss
 Sep 26 11:14:50 corosync [MAIN  ] Successfully read config from 
 /etc/cluster/cluster.conf
 Sep 26 11:14:50 corosync [MAIN  ] Successfully parsed cman config
 Sep 26 11:14:50 corosync [MAIN  ] Successfully configured openais services 
 to load
 Sep 26 11:14:50 corosync [TOTEM ] Initializing transport (UDP/IP Multicast).
 Sep 26 11:14:50 corosync [TOTEM ] Initializing transmit/receive security: 
 libtomcrypt SOBER128/SHA1HMAC (mode 0).
 Sep 26 11:14:50 corosync [TOTEM ] The network interface [10.0.0.34] is now 
 up.
 Sep 26 11:14:50 corosync [QUORUM] Using quorum provider quorum_cman
 Sep 26 11:14:50 corosync [SERV  ] Service engine loaded: corosync cluster 
 quorum service v0.1
 Sep 26 11:14:50 corosync [CMAN  ] CMAN 3.1.8 (built Jan 17 2013 06:24:33) 
 started
 Sep 26 11:14:50 corosync [SERV  ] Service engine loaded: corosync CMAN 
 membership service 2.90
 Sep 26 11:14:50 corosync [SERV  ] Service engine loaded: openais cluster 
 membership service B.01.01
 Sep 26 11:14:50 corosync [SERV  ] Service engine loaded: openais event 
 service B.01.01
 Sep 26 11:14:50 corosync [SERV  ] Service engine loaded: openais checkpoint 
 service B.01.01
 Sep 26 11:14:50 corosync [SERV  ] Service engine loaded: openais message 
 service B.03.01
 Sep 26 11:14:50 corosync [SERV  ] Service engine loaded: openais distributed 
 locking service B.03.01
 Sep 26 11:14:50 corosync [SERV  ] Service engine loaded: openais timer 
 service A.01.01
 Sep 26 11:14:50 corosync [SERV  ] Service engine loaded: corosync extended 
 virtual synchrony service
 Sep 26 11:14:50 corosync [SERV  ] Service engine loaded: corosync 
 configuration service
 Sep 26 11:14:50 corosync [SERV  ] Service engine loaded: corosync cluster 
 closed process group service v1.01
 Sep 26 11:14:50 corosync [SERV  ] Service engine loaded: corosync cluster 
 config database access v1.01
 Sep 26 11:14:50 corosync [SERV  ] Service engine loaded: corosync profile 
 loading service
 Sep 26 11:14:50 corosync [QUORUM] Using quorum provider quorum_cman
 Sep 26 11:14:50 corosync [SERV  ] Service engine loaded: corosync cluster 
 quorum service v0.1
 Sep 26 11:14:56 corosync [CLM   ] Members Left:
 Sep 26 11:14:56 corosync [CLM   ] Members Joined:
 Sep 26 11:14:56 corosync [CLM   ]   r(0) ip(10.0.0.35)
 Sep 26 11:14:56 corosync [TOTEM ] A processor joined or left the membership 
 and a new membership was formed.
 Set r/w permissions for uid=108, gid=0 on /var/log/cluster/corosync.log

Re: [Pacemaker] Could not initialize corosync configuration API error 2

2013-10-31 Thread Jan Friesse
Andrew,
this problem was already discussed on corosync-ml.

Andrew Beekhof napsal(a):
 Jan: not sure if you're on the pacemaker list
 
 On 29 Oct 2013, at 6:43 pm, Bauer, Stefan (IZLBW Extern) 
 stefan.ba...@iz.bwl.de wrote:
 
 Dear Developers/Users,
  
 we’re using Pacemaker 1.1.7 and Corosync Cluster Engine 1.4.2 with Debian 6 
 and a recent vanilla Kernel (3.10).
  
 On quite a lot of our clusters we can not check the ring status anymore:
  
 corosync-cfgtool –s returns:
  
 Could not initialize corosync configuration API error 2
  
 A reboot is fixing the problem.
  
 Even though the status is not returned, i see traffic on the ring interfaces 
 and the cluster is operational.
  
 We’re using rrp_mode: active with 2 ring interfaces with multicast.
  
 Is this a known problem?
 
 Not that I know of.  CC'ing Jan (corosync maintainer)

Please try upgrade from 1.4.2 to 1.4.6. There are about 105 patches and
(acording to git) 83 files changed, 2623 insertions(+), 652
deletions(-). There are no new features, only fixes.

 
 Does a workaround exist to not force us to reboot the machines regularly ?
  
 Any help is greatly appreciated.
  
 Regards
  
 Stefan Bauer
  
 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 


Regards,
  Honza

___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker 1.1.8 and corosync's cpg service?

2013-05-22 Thread Jan Friesse
Mike,
did you entered local node in nodelist? Because this may explain
behavior you were describing.

Honza

Mike Edwards napsal(a):
 On Tue, May 21, 2013 at 11:15:56AM +1000, Andrew Beekhof babbled thus:
 cpg_join() is returning CS_ERR_TRY_AGAIN here.

 Jan: Any idea why this might happen?  Thats a fair time to be blocked for.
 
 Looks like the problem was with the udpu transport.  Switching to udp
 let pacemaker start.
 
 I've also noticed that multicast fails to work in this environment,
 though whether the issue lies with our switches, Vmware, or CentOS 6
 itself, I'm unsure as of yet.
 
 Thanks Andrew.
 
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker 1.1.8 and corosync's cpg service?

2013-05-22 Thread Jan Friesse
Mike Edwards napsal(a):
 Which would be the recommended trqansport?  I'm not tied to any
 particular method.
 

As long as UDP (multicast) works for you, it's better solution (better
tested, faster, ...). UDPU is targeted for deployments where multicast
is problem.

Regards,
  Honza

 
 On Wed, May 22, 2013 at 10:01:37AM +1000, Andrew Beekhof babbled thus:
 I think nodelist only works for corosync 2.x
 So if you want to use udpu you might need to look up the corosync 1.x syntax.
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] Pacemaker 1.1.8 and corosync's cpg service?

2013-05-22 Thread Jan Friesse
Actually,
I've reviewed that config file again and it looks like you are using
corosync 1.x. There nodelist is really not supported, and supported is
member object inside of interface (see corosync.conf.example.udpu). For
corosync 2.x, member object inside interface object works also, but it's
internally converted to recommended version with nodelist (so that's
what you've sent).

Regards,
  Honza

Mike Edwards napsal(a):
 Yep.  The config I pasted has the bindnetaddr set to 10.10.23.50, which
 also happens to be defined as node 1.
 
 
 On Wed, May 22, 2013 at 09:28:13AM +0200, Jan Friesse babbled thus:
 Mike,
 did you entered local node in nodelist? Because this may explain
 behavior you were describing.

 Honza
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [Openais] Hawk 0.5.2 Debian packages

2013-02-26 Thread Jan Friesse
Great news!

Regards,
  Honza

Charles Williams napsal(a):
 Hey all,
 
 I recently got a chance to finally build Debian packages for the
 0.5.2 version of ClusterLabs Hawk GUI. These are Squeeze packages
 ATM (Wheezy to come next week dependent upon testing of the
 current packages) and I am looking for people interested in
 testing.
 
 If so. just head over to 
 http://wiki.itadmins.net/doku.php?id=high_availability:hawk0.5.2
 
 if you have any problems or such just let me know. I would like to
 be able to get Wheezy packages finished in the next couple of
 weeks.
 
 Thanks for your time, Chuck 
 ___ Openais mailing
 list open...@lists.linux-foundation.org 
 https://lists.linuxfoundation.org/mailman/listinfo/openais
 


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [corosync] Corosync memory usage rising

2013-02-04 Thread Jan Friesse
Andrew Beekhof napsal(a):
 On Thu, Jan 31, 2013 at 8:10 AM, Yves Trudeau y.trud...@videotron.ca wrote:
 Hi,
Is there any known memory leak issue corosync 1.4.1.  I have a setup here
 where corosync eats memory at a few kB a minute:

1.4.1 for sure. But it looks you are using 1.4.1-7 (EL 6.3.z), and I
must say no, there is no known bug like this.

Are you running pacemaker (if so plugin or cpg version)? OpenAIS
services loaded? Is it clean corosync or corosync executed via cman?

Honza


 [root@mys002 mysql]# while [ 1 ]; do ps faxu | grep corosync | grep -v grep;
 sleep 60; done
 root 11071  0.2  0.0 624256  8840 ?Ssl  09:14   0:02 corosync
 root 11071  0.2  0.0 624344  9144 ?Ssl  09:14   0:02 corosync
 root 11071  0.2  0.0 624344  9424 ?Ssl  09:14   0:02 corosync

 It goes on like that until no more memory which is still a long time.
 Another has corosync running for a long time:

 [root@mys001 mysql]# ps faxu | grep corosync | grep -v grep
 root 15735  0.2 21.5 4038664 3429592 ? Ssl   2012 184:19 corosync

 which is nearly 3.4GB.
 
 Holy heck!
 Bouncing to the corosync ML for comment.
 

 [root@mys002 mysql]# rpm -qa | grep -i coro
 corosynclib-1.4.1-7.el6_3.1.x86_64
 corosync-1.4.1-7.el6_3.1.x86_64
 [root@mys002 mysql]# uname -a
 Linux mys002 2.6.32-220.el6.x86_64 #1 SMP Tue Dec 6 19:48:22 GMT 2011 x86_64
 x86_64 x86_64 GNU/Linux

 looking at smaps of the process, I found this:

 020b6000-d2b34000 rw-p  00:00 0
 Size:3418616 kB
 Rss: 3417756 kB
 Pss: 3417756 kB
 Shared_Clean:  0 kB
 Shared_Dirty:  0 kB
 Private_Clean: 0 kB
 Private_Dirty:   3417756 kB
 Referenced:  3417064 kB
 Anonymous:   3417756 kB
 AnonHugePages:   3416064 kB
 Swap:  0 kB
 KernelPageSize:4 kB
 MMUPageSize:   4 kB


 this setup is using udpu

 totem {
 version: 2
 secauth: on
 threads: 0

  window_size: 5
  max_messages: 5
  netmtu: 1000

  token: 5000
  join: 1000
  consensus: 5000

 interface {
  member {
 memberaddr: 10.103.7.91
 }
 member {
 memberaddr: 10.103.7.92
 }
 ringnumber: 0
 bindnetaddr: 10.103.7.91
 mcastport: 5405
 ttl: 1
 }
  transport: udpu
 }

 with special timings because of issues with the vmware setup.

 Any idea of what could be causing this?

 Regards,

 Yves

 ___
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
 http://oss.clusterlabs.org/mailman/listinfo/pacemaker

 Project Home: http://www.clusterlabs.org
 Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
 Bugs: http://bugs.clusterlabs.org
 ___
 discuss mailing list
 disc...@corosync.org
 http://lists.corosync.org/mailman/listinfo/discuss


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org


Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-08 Thread Jan Friesse
: 11,075,087 bytes in 1,613 blocks 
 ==5453== suppressed: 0 bytes in 0 blocks 
 ==5453== Rerun with --leak-check=full to see details of leaked memory 
 ==5453== 
 ==5453== For counts of detected and suppressed errors, rerun with: -v 
 ==5453== Use --track-origins=yes to see where uninitialised values come from 
 ==5453== ERROR SUMMARY: 715 errors from 3 contexts (suppressed: 2 from 2) 
 Bus error (core dumped) 
 
 
 I was also able to capture non-truncated fdata: 
 http://sources.xes-inc.com/downloads/fdata-20121107 
 
 
 Here is the coredump: 
 http://sources.xes-inc.com/downloads/vgcore.5453 
 
 
 I was not able to get corosync to crash without pacemaker also running, 
 though I was not able to test for a long period of time. 
 
 
 Another thing I discovered tonight was that the 127.0.1.1 entry in /etc/hosts 
 (on both storage0 and storage1) was the source of the extra localhost entry 
 in the cluster. I have removed this extraneous node so now only the 3 real 
 nodes remain and commented out this line in /etc/hosts on all nodes in the 
 cluster. 
 http://burning-midnight.blogspot.com/2012/07/cluster-building-ubuntu-1204-revised.html
  
 
 
 Thanks, 
 
 
 Andrew 
 - Original Message -
 
 From: Jan Friesse jfrie...@redhat.com 
 To: Andrew Martin amar...@xes-inc.com 
 Cc: Angus Salkeld asalk...@redhat.com, disc...@corosync.org, 
 pacemaker@oss.clusterlabs.org 
 Sent: Wednesday, November 7, 2012 2:00:20 AM 
 Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
 cluster 
 
 Andrew, 
 
 Andrew Martin napsal(a): 
 A bit more data on this problem: I was doing some maintenance and had to 
 briefly disconnect storagequorum's connection to the STONITH network 
 (ethernet cable #7 in this diagram): 
 http://sources.xes-inc.com/downloads/storagecluster.png 


 Since corosync has two rings (and is in active mode), this should cause no 
 disruption to the cluster. However, as soon as I disconnected cable #7, 
 corosync on storage0 died (corosync was already stopped on storage1), which 
 caused pacemaker on storage0 to also shutdown. I was not able to obtain a 
 coredump this time as apport is still running on storage0. 
 
 I strongly believe corosync fault is because of original problem you 
 have. Also I would recommend you to try passive mode. Passive mode is 
 better, because if one link fails, passive mode make progress (delivers 
 messages), where active mode doesn't (up to moment, when ring is marked 
 as failed. After that, passive/active behaves same). Also passive mode 
 is much better tested. 
 


 What else can I do to debug this problem? Or, should I just try to downgrade 
 to corosync 1.4.2 (the version available in the Ubuntu repositories)? 
 
 I would really like to find main issue (which looks like libqb one, 
 rather then corosync). But if you decide to downgrade, please downgrade 
 to latest 1.4.x series (1.4.4 for now). 1.4.2 has A LOT of known bugs. 
 


 Thanks, 


 Andrew 
 
 Regards, 
 Honza 
 

 - Original Message - 

 From: Andrew Martin amar...@xes-inc.com 
 To: Angus Salkeld asalk...@redhat.com 
 Cc: disc...@corosync.org, pacemaker@oss.clusterlabs.org 
 Sent: Tuesday, November 6, 2012 2:01:17 PM 
 Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
 cluster 


 Hi Angus, 


 I recompiled corosync with the changes you suggested in exec/main.c to 
 generate fdata when SIGBUS is triggered. Here 's the corresponding coredump 
 and fdata files: 
 http://sources.xes-inc.com/downloads/core.13027 
 http://sources.xes-inc.com/downloads/fdata.20121106 



 (gdb) thread apply all bt 


 Thread 1 (Thread 0x77fec700 (LWP 13027)): 
 #0 0x7775bda3 in qb_rb_chunk_alloc () from /usr/lib/libqb.so.0 
 #1 0x777656b9 in ?? () from /usr/lib/libqb.so.0 
 #2 0x777637ba in qb_log_real_va_ () from /usr/lib/libqb.so.0 
 #3 0x55571700 in ?? () 
 #4 0x77bc7df6 in ?? () from /usr/lib/libtotem_pg.so.5 
 #5 0x77bc1a6d in rrp_deliver_fn () from /usr/lib/libtotem_pg.so.5 
 #6 0x77bbc8e2 in ?? () from /usr/lib/libtotem_pg.so.5 
 #7 0x7775d46f in ?? () from /usr/lib/libqb.so.0 
 #8 0x7775cfe7 in qb_loop_run () from /usr/lib/libqb.so.0 
 #9 0x55560945 in main () 




 I've also been doing some hardware tests to rule it out as the cause of this 
 problem: mcelog has found no problems and memtest finds the memory to be 
 healthy as well. 


 Thanks, 


 Andrew 
 - Original Message - 

 From: Angus Salkeld asalk...@redhat.com 
 To: pacemaker@oss.clusterlabs.org, disc...@corosync.org 
 Sent: Friday, November 2, 2012 8:18:51 PM 
 Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
 cluster 

 On 02/11/12 13:07 -0500, Andrew Martin wrote: 
 Hi Angus, 


 Corosync died again while using libqb 0.14.3. Here is the coredump from 
 today: 
 http://sources.xes-inc.com/downloads/corosync.nov2.coredump 



 # corosync -f 
 notice [MAIN ] Corosync Cluster Engine ('2.1.0

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-08 Thread Jan Friesse
Andrew,
good news. I believe that I've found reproducer for problem you are
facing. Now, to be sure it's really same, can you please run :
df (interesting is /dev/shm)
and send output of ls -la /dev/shm?

I believe /dev/shm is full.

Now, as a quick workaround, just delete all qb-* from /dev/shm and
cluster should work. There are basically two problems:
- ipc_shm is leaking memory
- if there is no memory, libqb mmap nonallocated memory and receives sigbus

Angus is working on both issues.

Regards,
  Honza

Jan Friesse napsal(a):
 Andrew,
 thanks for valgrind report (even it didn't showed anything useful) and
 blackbox.
 
 We believe that problem is because of access to invalid memory mapped by
 mmap operation. There are basically 3 places where we are doing mmap.
 1.) corosync cpg_zcb functions (I don't believe this is the case)
 2.) LibQB IPC
 3.) LibQB blackbox
 
 Now, because nether me nor Angus are able to reproduce the bug, can you
 please:
 - apply patches Check successful initialization of IPC and Add
 support for selecting IPC type (later versions), or use corosync from
 git (ether needle or master branch, they are same)
 - compile corosync
 - Add
 
 qb {
 ipc_type: socket
 }
 
 to corosync.conf
 - Try running corosync
 
 This may, but may not help solve problem, but it should help us to
 diagnose if problem is or isn't IPC one.
 
 Thanks,
   Honza
 
 Andrew Martin napsal(a):
 Angus and Honza, 


 I recompiled corosync with --enable-debug. Below is a capture of the 
 valgrind output when corosync dies, after switching rrp_mode to passive: 

 # valgrind corosync -f 
 ==5453== Memcheck, a memory error detector 
 ==5453== Copyright (C) 2002-2011, and GNU GPL'd, by Julian Seward et al. 
 ==5453== Using Valgrind-3.7.0 and LibVEX; rerun with -h for copyright info 
 ==5453== Command: corosync -f 
 ==5453== 
 notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to 
 provide service. 
 info [MAIN ] Corosync built-in features: debug pie relro bindnow 
 ==5453== Syscall param socketcall.sendmsg(msg) points to uninitialised 
 byte(s) 
 ==5453== at 0x54D233D: ??? (syscall-template.S:82) 
 ==5453== by 0x4E391E8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E3BFC8: totemudp_token_send (in 
 /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E38CF0: totemnet_token_send (in 
 /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E3F1AF: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E40FB5: totemrrp_token_send (in 
 /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E47E84: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E45770: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E40AD2: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E3C1A4: totemudp_token_target_set (in 
 /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E38EBC: totemnet_token_target_set (in 
 /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E3F3A8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== Address 0x7feff7f58 is on thread 1's stack 
 ==5453== 
 ==5453== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to 
 uninitialised byte(s) 
 ==5453== at 0x54D233D: ??? (syscall-template.S:82) 
 ==5453== by 0x4E39427: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E3C042: totemudp_mcast_noflush_send (in 
 /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E38D8A: totemnet_mcast_noflush_send (in 
 /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E3F03D: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E4104D: totemrrp_mcast_noflush_send (in 
 /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E46CB8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E49A04: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E4C8E0: main_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E3F0A6: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E409A8: rrp_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E39967: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== Address 0x7feffb9da is on thread 1's stack 
 ==5453== 
 ==5453== Syscall param socketcall.sendmsg(msg.msg_iov[i]) points to 
 uninitialised byte(s) 
 ==5453== at 0x54D233D: ??? (syscall-template.S:82) 
 ==5453== by 0x4E39526: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E3C042: totemudp_mcast_noflush_send (in 
 /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E38D8A: totemnet_mcast_noflush_send (in 
 /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E3F03D: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E4104D: totemrrp_mcast_noflush_send (in 
 /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E46CB8: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E49A04: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E4C8E0: main_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E3F0A6: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E409A8: rrp_deliver_fn (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== by 0x4E39967: ??? (in /usr/lib/libtotem_pg.so.5.0.0) 
 ==5453== Address

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-07 Thread Jan Friesse
 that there is a newer version of libqb (0.14.3) out, but didn't see 
 a fix for this particular bug. Could this libqb problem be related to the 
 corosync to hang up? Here's the corresponding corosync log file (next time 
 I should have a core dump as well): 
 http://pastebin.com/5FLKg7We 

 Hi Andrew 

 I can't see much wrong with the log either. If you could run with the 
 latest 
 (libqb-0.14.3) and post a backtrace if it still happens, that would be 
 great. 

 Thanks 
 Angus 



 Thanks, 


 Andrew 

 - Original Message - 

 From: Jan Friesse jfrie...@redhat.com 
 To: Andrew Martin amar...@xes-inc.com 
 Cc: disc...@corosync.org, The Pacemaker cluster resource manager 
 pacemaker@oss.clusterlabs.org 
 Sent: Thursday, November 1, 2012 7:55:52 AM 
 Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster 

 Ansdrew, 
 I was not able to find anything interesting (from corosync point of 
 view) in configuration/logs (corosync related). 

 What would be helpful: 
 - if corosync died, there should be 
 /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please 
 xz them and store somewhere (they are quiet large but well compressible). 
 - If you are able to reproduce problem (what seems like you are), can 
 you please allow generating of coredumps and store somewhere backtrace 
 of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and 
 way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and 
 here thread apply all bt). If you are running distribution with ABRT 
 support, you can also use ABRT to generate report. 

 Regards, 
 Honza 

 Andrew Martin napsal(a): 
 Corosync died an additional 3 times during the night on storage1. I wrote 
 a daemon to attempt and start it as soon as it fails, so only one of 
 those times resulted in a STONITH of storage1. 

 I enabled debug in the corosync config, so I was able to capture a period 
 when corosync died with debug output: 
 http://pastebin.com/eAmJSmsQ 
 In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For 
 reference, here is my Pacemaker configuration: 
 http://pastebin.com/DFL3hNvz 

 It seems that an extra node, 16777343 localhost has been added to the 
 cluster after storage1 was STONTIHed (must be the localhost interface on 
 storage1). Is there anyway to prevent this? 

 Does this help to determine why corosync is dying, and what I can do to 
 fix it? 

 Thanks, 

 Andrew 

 - Original Message - 

 From: Andrew Martin amar...@xes-inc.com 
 To: disc...@corosync.org 
 Sent: Thursday, November 1, 2012 12:11:35 AM 
 Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster 


 Hello, 

 I recently configured a 3-node fileserver cluster by building Corosync 
 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running 
 Ubuntu 12.04 amd64. Two of the nodes (storage0 and storage1) are real 
 nodes where the resources run (a DRBD disk, filesystem mount, and 
 samba/nfs daemons), while the third node (storagequorum) is in standby 
 mode and acts as a quorum node for the cluster. Today I discovered that 
 corosync died on both storage0 and storage1 at the same time. Since 
 corosync died, pacemaker shut down as well on both nodes. Because the 
 cluster no longer had quorum (and the no-quorum-policy=freeze), 
 storagequorum was unable to STONITH either node and just left the 
 resources frozen where they were running, on storage0. I cannot find any 
 log information to determine why corosync crashed, and this is a 
 disturbing problem as the cluster and its messaging layer must be stable. 
 Below is my corosync configuration file as well as the corosync log file 
 from eac!
 h! 
 ! 
 n! 
 o! 
 de during 
 this period. 

 corosync.conf: 
 http://pastebin.com/vWQDVmg8 
 Note that I have two redundant rings. On one of them, I specify the IP 
 address (in this example 10.10.10.7) so that it binds to the correct 
 interface (since potentially in the future those machines may have two 
 interfaces on the same subnet). 

 corosync.log from storage0: 
 http://pastebin.com/HK8KYDDQ 

 corosync.log from storage1: 
 http://pastebin.com/sDWkcPUz 

 corosync.log from storagequorum (the DC during this period): 
 http://pastebin.com/uENQ5fnf 

 Issuing service corosync start  service pacemaker start on storage0 and 
 storage1 resolved the problem and allowed the nodes to successfully 
 reconnect to the cluster. What other information can I provide to help 
 diagnose this problem and prevent it from recurring? 

 Thanks, 

 Andrew Martin 

 ___ 
 discuss mailing list 
 disc...@corosync.org 
 http://lists.corosync.org/mailman/listinfo/discuss 





 ___ 
 discuss mailing list 
 disc...@corosync.org 
 http://lists.corosync.org/mailman/listinfo/discuss 



 ___ 
 Pacemaker mailing list: Pacemaker@oss.clusterlabs.org 
 http://oss.clusterlabs.org/mailman

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-07 Thread Jan Friesse
 it will generate it automatically. 
 (I see you are getting a bus error) - :(. 

 -A 



 Thanks, 

 Andrew 

 - Original Message - 

 From: Angus Salkeld asalk...@redhat.com 
 To: pacemaker@oss.clusterlabs.org, disc...@corosync.org 
 Sent: Thursday, November 1, 2012 5:11:23 PM 
 Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes in 
 cluster 

 On 01/11/12 14:32 -0500, Andrew Martin wrote: 
 Hi Honza, 


 Thanks for the help. I enabled core dumps in /etc/security/limits.conf but 
 didn't have a chance to reboot and apply the changes so I don't have a 
 core dump this time. Do core dumps need to be enabled for the 
 fdata-DATETIME-PID file to be generated? right now all that is in 
 /var/lib/corosync are the ringid_XXX files. Do I need to set something 
 explicitly in the corosync config to enable this logging? 


 I did find find something else interesting with libqb this time. I 
 compiled libqb 0.14.2 for use with the cluster. This time when corosync 
 died I noticed the following in dmesg: 
 Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap divide 
 error ip:7f657a52e517 sp:7fffd5068858 error:0 in 
 libqb.so.0.14.2[7f657a525000+1f000] 
 This error was only present for one of the many other times corosync has 
 died. 


 I see that there is a newer version of libqb (0.14.3) out, but didn't see 
 a fix for this particular bug. Could this libqb problem be related to the 
 corosync to hang up? Here's the corresponding corosync log file (next time 
 I should have a core dump as well): 
 http://pastebin.com/5FLKg7We 

 Hi Andrew 

 I can't see much wrong with the log either. If you could run with the 
 latest 
 (libqb-0.14.3) and post a backtrace if it still happens, that would be 
 great. 

 Thanks 
 Angus 



 Thanks, 


 Andrew 

 - Original Message - 

 From: Jan Friesse jfrie...@redhat.com 
 To: Andrew Martin amar...@xes-inc.com 
 Cc: disc...@corosync.org, The Pacemaker cluster resource manager 
 pacemaker@oss.clusterlabs.org 
 Sent: Thursday, November 1, 2012 7:55:52 AM 
 Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster 

 Ansdrew, 
 I was not able to find anything interesting (from corosync point of 
 view) in configuration/logs (corosync related). 

 What would be helpful: 
 - if corosync died, there should be 
 /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please 
 xz them and store somewhere (they are quiet large but well compressible). 
 - If you are able to reproduce problem (what seems like you are), can 
 you please allow generating of coredumps and store somewhere backtrace 
 of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and 
 way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and 
 here thread apply all bt). If you are running distribution with ABRT 
 support, you can also use ABRT to generate report. 

 Regards, 
 Honza 

 Andrew Martin napsal(a): 
 Corosync died an additional 3 times during the night on storage1. I wrote 
 a daemon to attempt and start it as soon as it fails, so only one of 
 those times resulted in a STONITH of storage1. 

 I enabled debug in the corosync config, so I was able to capture a period 
 when corosync died with debug output: 
 http://pastebin.com/eAmJSmsQ 
 In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For 
 reference, here is my Pacemaker configuration: 
 http://pastebin.com/DFL3hNvz 

 It seems that an extra node, 16777343 localhost has been added to the 
 cluster after storage1 was STONTIHed (must be the localhost interface on 
 storage1). Is there anyway to prevent this? 

 Does this help to determine why corosync is dying, and what I can do to 
 fix it? 

 Thanks, 

 Andrew 

 - Original Message - 

 From: Andrew Martin amar...@xes-inc.com 
 To: disc...@corosync.org 
 Sent: Thursday, November 1, 2012 12:11:35 AM 
 Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster 


 Hello, 

 I recently configured a 3-node fileserver cluster by building Corosync 
 2.1.0 and Pacemaker 1.1.8 from source. All of the nodes are running 
 Ubuntu 12.04 amd64. Two of the nodes (storage0 and storage1) are real 
 nodes where the resources run (a DRBD disk, filesystem mount, and 
 samba/nfs daemons), while the third node (storagequorum) is in standby 
 mode and acts as a quorum node for the cluster. Today I discovered that 
 corosync died on both storage0 and storage1 at the same time. Since 
 corosync died, pacemaker shut down as well on both nodes. Because the 
 cluster no longer had quorum (and the no-quorum-policy=freeze), 
 storagequorum was unable to STONITH either node and just left the 
 resources frozen where they were running, on storage0. I cannot find any 
 log information to determine why corosync crashed, and this is a 
 disturbing problem as the cluster and its messaging layer must be stable. 
 Below is my corosync configuration file as well as the corosync log file 
 from eac!
 h! 
 ! 
 n! 
 o! 
 de

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-05 Thread Jan Friesse
Angus Salkeld napsal(a):
 On 02/11/12 13:07 -0500, Andrew Martin wrote:
 Hi Angus,


 Corosync died again while using libqb 0.14.3. Here is the coredump
 from today:
 http://sources.xes-inc.com/downloads/corosync.nov2.coredump



 # corosync -f
 notice [MAIN ] Corosync Cluster Engine ('2.1.0'): started and ready to
 provide service.
 info [MAIN ] Corosync built-in features: pie relro bindnow
 Bus error (core dumped)


 Here's the log: http://pastebin.com/bUfiB3T3


 Did your analysis of the core dump reveal anything?

 
 I can't get any symbols out of these coredumps. Can you try get a
 backtrace?
 

Andrew,
as I've wrote in original mail, backtrace can be got by:

coredumps are stored in /var/lib/corosync as core.PID, and
way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and
here thread apply all bt). If you are running distribution with ABRT
support, you can also use ABRT to generate report.

It's also pretty weird that you are getting SIGBUS. SIGBUS is pretty
usually result of accessing unaligned memory on processors without
support to access that (for example Sparc). This doesn't seem to be your
case (because of AMD64).



 Is there a way for me to make it generate fdata with a bus error, or
 how else can I gather additional information to help debug this?

 
 if you look in exec/main.c and look for SIGSEGV you will see how the
 mechanism
 for fdata works. Just and a handler for SIGBUS and hook it up. Then you
 should
 be able to get the fdata for both.
 
 I'd rather be able to get a backtrace if possible.
 

Also if possible, please try to compile with --enable-debug (both libqb
and corosync) to get as much information as possible.

 -Angus
 

Regards,
  Honza


 Thanks,


 Andrew

 - Original Message -

 From: Angus Salkeld asalk...@redhat.com
 To: pacemaker@oss.clusterlabs.org, disc...@corosync.org
 Sent: Thursday, November 1, 2012 5:47:16 PM
 Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes
 in cluster

 On 01/11/12 17:27 -0500, Andrew Martin wrote:
 Hi Angus,


 I'll try upgrading to the latest libqb tomorrow and see if I can
 reproduce this behavior with it. I was able to get a coredump by
 running corosync manually in the foreground (corosync -f):
 http://sources.xes-inc.com/downloads/corosync.coredump

 Thanks, looking...



 There still isn't anything added to /var/lib/corosync however. What
 do I need to do to enable the fdata file to be created?

 Well if it crashes with SIGSEGV it will generate it automatically.
 (I see you are getting a bus error) - :(.

 -A



 Thanks,

 Andrew

 - Original Message -

 From: Angus Salkeld asalk...@redhat.com
 To: pacemaker@oss.clusterlabs.org, disc...@corosync.org
 Sent: Thursday, November 1, 2012 5:11:23 PM
 Subject: Re: [corosync] [Pacemaker] Corosync 2.1.0 dies on both nodes
 in cluster

 On 01/11/12 14:32 -0500, Andrew Martin wrote:
 Hi Honza,


 Thanks for the help. I enabled core dumps in
 /etc/security/limits.conf but didn't have a chance to reboot and
 apply the changes so I don't have a core dump this time. Do core
 dumps need to be enabled for the fdata-DATETIME-PID file to be
 generated? right now all that is in /var/lib/corosync are the
 ringid_XXX files. Do I need to set something explicitly in the
 corosync config to enable this logging?


 I did find find something else interesting with libqb this time. I
 compiled libqb 0.14.2 for use with the cluster. This time when
 corosync died I noticed the following in dmesg:
 Nov 1 13:21:01 storage1 kernel: [31036.617236] corosync[13305] trap
 divide error ip:7f657a52e517 sp:7fffd5068858 error:0 in
 libqb.so.0.14.2[7f657a525000+1f000]
 This error was only present for one of the many other times corosync
 has died.


 I see that there is a newer version of libqb (0.14.3) out, but
 didn't see a fix for this particular bug. Could this libqb problem
 be related to the corosync to hang up? Here's the corresponding
 corosync log file (next time I should have a core dump as well):
 http://pastebin.com/5FLKg7We

 Hi Andrew

 I can't see much wrong with the log either. If you could run with the
 latest
 (libqb-0.14.3) and post a backtrace if it still happens, that would
 be great.

 Thanks
 Angus



 Thanks,


 Andrew

 - Original Message -

 From: Jan Friesse jfrie...@redhat.com
 To: Andrew Martin amar...@xes-inc.com
 Cc: disc...@corosync.org, The Pacemaker cluster resource manager
 pacemaker@oss.clusterlabs.org
 Sent: Thursday, November 1, 2012 7:55:52 AM
 Subject: Re: [corosync] Corosync 2.1.0 dies on both nodes in cluster

 Ansdrew,
 I was not able to find anything interesting (from corosync point of
 view) in configuration/logs (corosync related).

 What would be helpful:
 - if corosync died, there should be
 /var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please
 xz them and store somewhere (they are quiet large but well
 compressible).
 - If you are able to reproduce problem (what seems like you are), can
 you please allow

Re: [Pacemaker] [corosync] Corosync 2.1.0 dies on both nodes in cluster

2012-11-01 Thread Jan Friesse
Ansdrew,
I was not able to find anything interesting (from corosync point of
view) in configuration/logs (corosync related).

What would be helpful:
- if corosync died, there should be
/var/lib/corosync/fdata-DATETTIME-PID of dead corosync. Can you please
xz them and store somewhere (they are quiet large but well compressible).
- If you are able to reproduce problem (what seems like you are), can
you please allow generating of coredumps and store somewhere backtrace
of coredump? (coredumps are stored in /var/lib/corosync as core.PID, and
way to obtain coredump is gdb corosync /var/lib/corosync/core.pid, and
here thread apply all bt). If you are running distribution with ABRT
support, you can also use ABRT to generate report.

Regards,
  Honza

Andrew Martin napsal(a):
 Corosync died an additional 3 times during the night on storage1. I wrote a 
 daemon to attempt and start it as soon as it fails, so only one of those 
 times resulted in a STONITH of storage1. 
 
 I enabled debug in the corosync config, so I was able to capture a period 
 when corosync died with debug output: 
 http://pastebin.com/eAmJSmsQ 
 In this example, Pacemaker finishes shutting down by Nov 01 05:53:02. For 
 reference, here is my Pacemaker configuration: 
 http://pastebin.com/DFL3hNvz 
 
 It seems that an extra node, 16777343 localhost has been added to the 
 cluster after storage1 was STONTIHed (must be the localhost interface on 
 storage1). Is there anyway to prevent this? 
 
 Does this help to determine why corosync is dying, and what I can do to fix 
 it? 
 
 Thanks, 
 
 Andrew 
 
 - Original Message -
 
 From: Andrew Martin amar...@xes-inc.com 
 To: disc...@corosync.org 
 Sent: Thursday, November 1, 2012 12:11:35 AM 
 Subject: [corosync] Corosync 2.1.0 dies on both nodes in cluster 
 
 
 Hello, 
 
 I recently configured a 3-node fileserver cluster by building Corosync 2.1.0 
 and Pacemaker 1.1.8 from source. All of the nodes are running Ubuntu 12.04 
 amd64. Two of the nodes (storage0 and storage1) are real nodes where the 
 resources run (a DRBD disk, filesystem mount, and samba/nfs daemons), while 
 the third node (storagequorum) is in standby mode and acts as a quorum node 
 for the cluster. Today I discovered that corosync died on both storage0 and 
 storage1 at the same time. Since corosync died, pacemaker shut down as well 
 on both nodes. Because the cluster no longer had quorum (and the 
 no-quorum-policy=freeze), storagequorum was unable to STONITH either node 
 and just left the resources frozen where they were running, on storage0. I 
 cannot find any log information to determine why corosync crashed, and this 
 is a disturbing problem as the cluster and its messaging layer must be 
 stable. Below is my corosync configuration file as well as the corosync log 
 file from each no!
 de during 
this period. 
 
 corosync.conf: 
 http://pastebin.com/vWQDVmg8 
 Note that I have two redundant rings. On one of them, I specify the IP 
 address (in this example 10.10.10.7) so that it binds to the correct 
 interface (since potentially in the future those machines may have two 
 interfaces on the same subnet). 
 
 corosync.log from storage0: 
 http://pastebin.com/HK8KYDDQ 
 
 corosync.log from storage1: 
 http://pastebin.com/sDWkcPUz 
 
 corosync.log from storagequorum (the DC during this period): 
 http://pastebin.com/uENQ5fnf 
 
 Issuing service corosync start  service pacemaker start on storage0 and 
 storage1 resolved the problem and allowed the nodes to successfully reconnect 
 to the cluster. What other information can I provide to help diagnose this 
 problem and prevent it from recurring? 
 
 Thanks, 
 
 Andrew Martin 
 
 ___ 
 discuss mailing list 
 disc...@corosync.org 
 http://lists.corosync.org/mailman/listinfo/discuss 
 
 
 
 
 
 ___
 discuss mailing list
 disc...@corosync.org
 http://lists.corosync.org/mailman/listinfo/discuss


___
Pacemaker mailing list: Pacemaker@oss.clusterlabs.org
http://oss.clusterlabs.org/mailman/listinfo/pacemaker

Project Home: http://www.clusterlabs.org
Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf
Bugs: http://bugs.clusterlabs.org