[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-11-09 Thread Stéphane Graber
** No longer affects: lxc (Ubuntu)

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB] Invalid IPC 
credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: HA 
Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: Aborting 
startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:error: 
pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process attrd no longer wishes to be 
respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Shuting down Pacemaker
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping crmd: Sent -15 to process 1033748
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_cib_control: 
Couldn't complete CIB registration 1 times... pause and retry
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: crm_shutdown: 
Requesting shutdown, upper limit is 120ms
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_log: FSA: 
Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: 
do_state_transition: State transition S_STARTING -> S_STOPPING [ 
input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
  Apr  2 11:41:32 juju-machine-4-lxc-4 cib[1033743]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: 
terminate_cs_connection: Disconnecting from Corosync
  Apr  2 11:41:32 juju-machine-4-lx

[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-09-04 Thread Billy Olsen
Serge,

I did double check that the pacemaker processes were running under
hacluster/haclient uid/gid. I will double check for my own sanity (I may
have seen one running as root). However, according to the pacemaker docs
that I referenced above, root and hacluster users should always have
full access (which is somewhat in conflict with the INSTALL file you
reference):

> Users are regular UNIX users, so the same user accounts must be present on 
> all nodes in the cluster.
>
> All user accounts must be in the haclient group.
> 
> Pacemaker 1.1.5 or newer must be installed on all cluster nodes.
> 
> The CIB must be configured to use the pacemaker-1.1 or 1.2 schema. This can 
> be set by running:
> 
> cibadmin --modify --xml-text ''
> The enable-acl option must be set. If ACLs are not explicitly enabled, the 
> previous behaviour will be used (i.e. all users in the haclient group have 
> full access):
>
> crm configure property enable-acl=true
> Once this is done, ACLs can be configured as described below.
>
> Note that the root and hacluster users will always have full access.
>
> If nonprivileged users will be using the crm shell and CLI tools (as opposed 
> to only using Hawk or the Python GUI) they will need to have /usr/sbin added 
> to their path.

If it were a necessity to add the ACL entry, then I would have expected
that the hacluster charm code would always have needed this requirement
and pacemaker should have always denied access. Additionally, since the
charm has done no configuration of the ACLs, I would expect all nodes to
get denied or allowed the same. Instead, what has been observed is that
*some* of the nodes in the cluster have the pacemaker process
successfully communicate with the corosync process, while others get
this invalid credentials error that is seen.

I've already proposed a change (which has been merged into the /next
branches of the hacluster charm) which incorporates JuanJo's comments
(thank you JuanJo!) by explicitly defining the ACL entry, but would
better like to understand why the inconsistent behavior.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  Confirmed
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 at

[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-09-04 Thread Serge Hallyn
Hi Billy,

So can you confirm that pacemaker *is* running under haclient/hacluster
uid/gid?

Note, the comments above don't seem correct to me.  The 'INSTALL' file
shipped with corosync says:

> Before running any of the test programs
> ---
> The corosync executive will ensure security by only allowing the UID 0(root) 
> or
> GID 0(root) to connect to it.  To allow other users to access the corosync
> executive, create a directory called /etc/corosync/uidgid.d and place a file 
> in
> it named in some way that is identifiable to you.  All files in this directory
> will be scanned and their contents added to the allowed uid gid database.  The
> contents of this file should be
> uidgid {
> uid: username
> gid: groupname
> }
> Please note that these users then have full ability to transmit and receive
> messages in the cluster and are not bound by the threat model described in
> SECURITY.

So the 'workaround' in comment #14 seems to be not a a workaround but
required configuration (for the charm to do).

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  Confirmed
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB] Invalid IPC 
credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: HA 
Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: Aborting 
startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:error: 
pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process attrd no longer wishes to be 
respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemaker

[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-09-03 Thread Billy Olsen
** Attachment added: "syslog"
   
https://bugs.launchpad.net/ubuntu/+source/pacemaker/+bug/1439649/+attachment/4457039/+files/syslog

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  Confirmed
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB] Invalid IPC 
credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: HA 
Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: Aborting 
startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:error: 
pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process attrd no longer wishes to be 
respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Shuting down Pacemaker
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping crmd: Sent -15 to process 1033748
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_cib_control: 
Couldn't complete CIB registration 1 times... pause and retry
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: crm_shutdown: 
Requesting shutdown, upper limit is 120ms
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_log: FSA: 
Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: 
do_state_transition: State transition S_STARTING -> S_STOPPING [ 
input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
  Apr  2 11:41:32 juju-machine-4-lxc-4 cib[1033743]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 1

[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-09-03 Thread Billy Olsen
Looking at logs from bug 1491228, it would appear that the first time
that pacemaker goes to talk to the corosync daemon it gets denied. Per
upstream docs [0], if the enable-acl property isn't explicitly enabled,
then any user in the haclient group should have access. Since the
hacluster charm doesn't explicitly enable the acl, I'd expect pacemaker
to be running under the haclient/hacluster uid/gid.

The charms don't enable the acl and the package creates the hacluster
user in the haclient group, so I suspect there's something additional
going on here. For completeness, I'll attach the syslog in here. The
pacemaker node did start after following JuanJo's workaround in comment
#15.

[0] - http://clusterlabs.org/doc/acls.html

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  Confirmed
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB] Invalid IPC 
credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: HA 
Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: Aborting 
startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:error: 
pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process attrd no longer wishes to be 
respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Shuting down Pacemaker
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping crmd: Sent -15 to process 1033748
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_cib_control: 
Couldn't complete CIB registration 1 times... pause and ret

[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-09-03 Thread David Britton
** Tags added: landscpae

** Tags removed: landscpae
** Tags added: landscape

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  Confirmed
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB] Invalid IPC 
credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: HA 
Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: Aborting 
startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:error: 
pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process attrd no longer wishes to be 
respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Shuting down Pacemaker
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping crmd: Sent -15 to process 1033748
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_cib_control: 
Couldn't complete CIB registration 1 times... pause and retry
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: crm_shutdown: 
Requesting shutdown, upper limit is 120ms
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_log: FSA: 
Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: 
do_state_transition: State transition S_STARTING -> S_STOPPING [ 
input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
  Apr  2 11:41:32 juju-machine-4-lxc-4 cib[1033743]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: 

[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-08-31 Thread JuanJo Ciarlante
FYI to re-check workaround (then possible actual fix), kicked corosync+pacemaker
on cinder, glance services deployed with juju:

$ juju run --service=cinder,glance  "service corosync restart; service
pacemaker restart"

, which broke pacemaker start on all of them, with same "Invalid IPC 
credentials":
http://paste.ubuntu.com/12240477/ , then obviously failing 'crm status' /etc.

Fixing using comment#14 workaround:
$ juju run --service=glance,cinder  "echo -e 'uidgid {\n uid: hacluster\n gid: 
haclient\n}' > /etc/corosync/uidgid.d/hacluster; service corosync restart; 
service pacemaker restart"

$ juju run --service=glance,cinder "crm status"
=> Ok

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  Confirmed
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB] Invalid IPC 
credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: HA 
Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: Aborting 
startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:error: 
pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process attrd no longer wishes to be 
respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Shuting down Pacemaker
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping crmd: Sent -15 to process 1033748
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_cib_control: 
Couldn't complete CIB registration 1 times... pause and retry
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: crm_shutdo

[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-08-31 Thread JuanJo Ciarlante
After trying several corosync/pacemaker restarts without luck,
I was able to workaround this by adding an 'uidgid'
entry for hacluster:haclient:

* from /var/log/syslog:
Aug 31 18:33:18 juju-machine-3-lxc-3 corosync[901082]:  [MAIN  ] Denied 
connection attempt from 108:113
$ getent passwd 108
hacluster:x:108:113::/var/lib/heartbeat:/bin/false
$ getent group 113
haclient:x:113:

* add uidgid config:
# echo $'uidgid {\n  uid: hacluster\n  gid: haclient\n}' > 
/etc/corosync/uidgid.d/hacluster

* restart => Ok (crm status, etc)

I can't explain why other units are working ok without
this ACL addition (racing at service setup/start?).

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  Confirmed
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB] Invalid IPC 
credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: HA 
Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: Aborting 
startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:error: 
pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process attrd no longer wishes to be 
respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Shuting down Pacemaker
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping crmd: Sent -15 to process 1033748
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_cib_control: 
Couldn't complete CIB registration 1 times... pause and retry
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: crm_shutdown: 
Requesting sh

[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-06-04 Thread Jill Rouleau
I can reproduce with corosync 2.3.3.  Using corosync 2.3.4 from
ppa:mariosplivalo/corosync on trusty I've not been able to reproduce on
2 tries.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  Confirmed
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB] Invalid IPC 
credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: HA 
Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: Aborting 
startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:error: 
pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process attrd no longer wishes to be 
respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Shuting down Pacemaker
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping crmd: Sent -15 to process 1033748
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_cib_control: 
Couldn't complete CIB registration 1 times... pause and retry
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: crm_shutdown: 
Requesting shutdown, upper limit is 120ms
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_log: FSA: 
Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: 
do_state_transition: State transition S_STARTING -> S_STOPPING [ 
input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
  Apr  2 11:41:32 juju-machine-4-lxc-4 cib[1033743]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosyn

[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-05-27 Thread Mario Splivalo
A restart of pacemaker and corosync (down pacemaker on all units, down
corosync on all units, start corosync, verify all is good, start
pacemaker) should resolve the issue.

@Jill: are you able to reproduce the issue? I'm assuming you're running
trusty - can you try reproducing using corosync and related packages
from vivid?

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  Confirmed
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB] Invalid IPC 
credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: HA 
Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: Aborting 
startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:error: 
pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process attrd no longer wishes to be 
respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Shuting down Pacemaker
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping crmd: Sent -15 to process 1033748
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_cib_control: 
Couldn't complete CIB registration 1 times... pause and retry
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: crm_shutdown: 
Requesting shutdown, upper limit is 120ms
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_log: FSA: 
Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: 
do_state_transition: State transition S_STARTING -> S_STOPPING [ 
inpu

[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-05-27 Thread Jill Rouleau
We've run into this problem after an extended maas/dhcp outage with
expiring leases on metals and units.  All hacluster-deployed lxc's
(openstack-ha services) lost corosync-pacemaker connectivity with
"corosync Invalid IPC credentials", resolving with lxc reboots.  This is
a staging cloud so we could take down maas-dhcp to replicate/test.

** Tags added: canonical-bootstack

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  Confirmed
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB] Invalid IPC 
credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: HA 
Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: Aborting 
startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:error: 
pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process attrd no longer wishes to be 
respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Shuting down Pacemaker
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping crmd: Sent -15 to process 1033748
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_cib_control: 
Couldn't complete CIB registration 1 times... pause and retry
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: crm_shutdown: 
Requesting shutdown, upper limit is 120ms
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_log: FSA: 
Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: 
do_state_transition:

[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-05-07 Thread Serge Hallyn
@jamespage,

i assume the answer to comment #5 was no, test package didn't fix it?
Was this in the end due to the mtu issue?

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  Confirmed
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB] Invalid IPC 
credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: HA 
Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: Aborting 
startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:error: 
pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process attrd no longer wishes to be 
respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Shuting down Pacemaker
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping crmd: Sent -15 to process 1033748
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_cib_control: 
Couldn't complete CIB registration 1 times... pause and retry
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: crm_shutdown: 
Requesting shutdown, upper limit is 120ms
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_log: FSA: 
Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: 
do_state_transition: State transition S_STARTING -> S_STOPPING [ 
input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
  Apr  2 11:41:32 juju-machine-4-lxc-4 cib[1033743]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 

[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-05-06 Thread Felipe Reyes
** Tags added: sts

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  Confirmed
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB] Invalid IPC 
credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: HA 
Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: Aborting 
startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:error: 
pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process attrd no longer wishes to be 
respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Shuting down Pacemaker
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping crmd: Sent -15 to process 1033748
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_cib_control: 
Couldn't complete CIB registration 1 times... pause and retry
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: crm_shutdown: 
Requesting shutdown, upper limit is 120ms
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_log: FSA: 
Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: 
do_state_transition: State transition S_STARTING -> S_STOPPING [ 
input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
  Apr  2 11:41:32 juju-machine-4-lxc-4 cib[1033743]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: 
terminate_cs_connection: Disconnecting from Corosync
  Apr 

[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-05-06 Thread Launchpad Bug Tracker
Status changed to 'Confirmed' because the bug affects multiple users.

** Changed in: lxc (Ubuntu)
   Status: New => Confirmed

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  Confirmed
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB] Invalid IPC 
credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: HA 
Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: Aborting 
startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:error: 
pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process attrd no longer wishes to be 
respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Shuting down Pacemaker
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping crmd: Sent -15 to process 1033748
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_cib_control: 
Couldn't complete CIB registration 1 times... pause and retry
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: crm_shutdown: 
Requesting shutdown, upper limit is 120ms
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_log: FSA: 
Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: 
do_state_transition: State transition S_STARTING -> S_STOPPING [ 
input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
  Apr  2 11:41:32 juju-machine-4-lxc-4 cib[1033743]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:

[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-05-06 Thread Felipe Reyes
I'm seeing this problem in another environment, similar deployment (3
lxc containers)

Apr 20 16:39:26 juju-machine-3-lxc-4 crm_verify[31774]:   notice: crm_log_args: 
Invoked: crm_verify -V -p 
Apr 20 16:39:27 juju-machine-3-lxc-4 cibadmin[31786]:   notice: crm_log_args: 
Invoked: cibadmin -p -P 
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[780]:error: pcmk_cpg_dispatch: 
Connection to the CPG API failed: Library error (2)
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[780]:error: cib_cs_destroy: 
Corosync connection lost!  Exiting.
Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[785]:error: crmd_quorum_destroy: 
connection terminated
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]:error: pcmk_cpg_dispatch: 
Connection to the CPG API failed: Library error (2)
Apr 20 16:50:01 juju-machine-3-lxc-4 stonith-ng[781]:error: 
pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[785]:   notice: crmd_exit: Forcing 
immediate exit: Link has been severed (67)
Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]:  warning: qb_ipcs_event_sendv: 
new_event_notification (782-785-6): Bad file descriptor (9)
Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]:  warning: send_client_notify: 
Notification of client crmd/8ad990ba-cf09-4ba3-b74b-a7d05d377a1b failed
Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]:error: crm_abort: 
crm_glib_handler: Forked child 760 to record non-fatal assert at logging.c:63 : 
Source ID 4601370 was not found when attempting to remove it
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]:error: 
pcmk_child_exit: Child process cib (780) exited: Invalid argument (22)
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]:   notice: 
pcmk_process_exit: Respawning failed child process: cib
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]:error: 
pcmk_child_exit: Child process crmd (785) exited: Link has been severed (67)
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]:   notice: 
pcmk_process_exit: Respawning failed child process: crmd
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: crit: attrd_cs_destroy: 
Lost connection to Corosync service!
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]:   notice: main: Exiting...
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]:   notice: main: Disconnecting 
client 0x7ff985e478e0, pid=785...
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]:error: 
pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2)
Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]:error: 
mcp_cpg_destroy: Connection destroyed
Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]:error: 
attrd_cib_connection_destroy: Connection to the CIB terminated...
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]:debug: crm_update_callsites: 
Enabling callsites based on priority=7, files=(null), functions=(null), 
formats=(null), tags=(null)
Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[767]:debug: crm_update_callsites: 
Enabling callsites based on priority=7, files=(null), functions=(null), 
formats=(null), tags=(null)
Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[767]:   notice: main: CRM Git 
Version: 42f2063
Apr 20 16:50:01 juju-machine-3-lxc-4 stonith-ng[781]:error: 
stonith_peer_cs_destroy: Corosync connection terminated
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]:   notice: crm_cluster_connect: 
Connecting to cluster infrastructure: corosync
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]:error: cluster_connect_cpg: 
Could not connect to the Cluster Process Group API: 2
Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: crit: cib_init: Cannot sign 
in to the cluster... terminating
Apr 20 16:50:02 juju-machine-3-lxc-4 crmd[767]:  warning: do_cib_control: 
Couldn't complete CIB registration 1 times... pause and retry
Apr 20 16:50:05 juju-machine-3-lxc-4 crmd[767]:  warning: do_cib_control: 
Couldn't complete CIB registration 2 times... pause and retry

These are the only processes running in one of the nodes:

root   782  0.0  0.0  81464  1828 ?Ss   Feb12  25:13 
/usr/lib/pacemaker/lrmd
haclust+   784  0.0  0.0  73920   776 ?Ss   Feb12   8:25 
/usr/lib/pacemaker/pengine
root   780  0.8  0.0 130256  4152 ?Ssl  16:50   0:00 
/usr/sbin/corosync


A possible explanation could be: 
http://thread.gmane.org/gmane.linux.highavailability.corosync/592/focus=639

I only have logs for one of the nodes, I'm trying to get logs of the
other 2 nodes to get a better understanding of what was happening with
the communication.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  Confirmed
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; 

[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-04-16 Thread Mario Splivalo
Hi, guys.

I am also actively testing to reproduce similar issue (that our customer
complained about) - I'm deploying three unit percona-cluster (each unit
in a separate LXC on separate physical machine) with three unit keystone
(both services have hacluster subodinated), but failed to reproduce it
so far - we made sure everything is set up correctly, run 'keystone
service-list' and similar, did a hard-reset via IPMI of the node where
percona-cluster VIP resides, corosync/pacemaker did their job, VIP
moved, 'keystone service-list' is happy...

I'll update here with more info as I go along.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  New
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB] Invalid IPC 
credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: HA 
Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: Aborting 
startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:error: 
pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process attrd no longer wishes to be 
respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Shuting down Pacemaker
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping crmd: Sent -15 to process 1033748
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_cib_control: 
Couldn't complete CIB registration 1 times... pause and retry
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: crm_shutdown: 
Requesting shutdown, upper limit is 120ms
  Apr  2 11:41:

[Touch-packages] [Bug 1439649] Re: Pacemaker unable to communicate with corosync on restart under lxc

2015-04-16 Thread James Page
I've being trying to reproduce this under KVM and on hardware (no LXC)
but I'm unable to reproduce the problem, so this appears isolated to
LXC.

The pacemaker <-> corosync communication occurs over IPC implemented
using shared memory.  I'm wondering whether this is managing to get into
an inconsistent state.

The other thing worth noting is that in the deployment impacted, there
are multiple corosync/pacemaker clusters running on the same physical
host, but under different LXC containers.

-- 
You received this bug notification because you are a member of Ubuntu
Touch seeded packages, which is subscribed to lxc in Ubuntu.
https://bugs.launchpad.net/bugs/1439649

Title:
  Pacemaker unable to communicate with corosync on restart under lxc

Status in lxc package in Ubuntu:
  New
Status in pacemaker package in Ubuntu:
  Confirmed

Bug description:
  We've seen this a few times with three node clusters, all running in
  LXC containers; pacemaker fails to restart correctly as it can't
  communicate with corosync, resulting in a down cluster.  Rebooting the
  containers resolves the issue, so suspect some sort of bad state
  either in corosync or pacemaker.

  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
mcp_read_config: Configured corosync to accept connections from group 115: 
Library error (2)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: main: 
Starting Pacemaker 1.1.10 (Build: 42f2063):  generated-manpages agent-manpages 
ncurses libqb-logging libqb-ipc lha-fencing upstart nagios  heartbeat 
corosync-native snmp libesmtp
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
cluster_connect_quorum: Quorum acquired
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1000
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node 
juju-machine-4-lxc-4[1001] - state is now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1003
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is 
now member (was (null))
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: main: CRM Git 
Version: 42f2063
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
corosync_node_name: Unable to get node name for nodeid 1001
  Apr  2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]:   notice: 
get_node_name: Defaulting to uname -n for the local corosync node name
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:   notice: 
crm_cluster_connect: Connecting to cluster infrastructure: corosync
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [MAIN  ] Denied 
connection attempt from 109:115
  Apr  2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]:  [QB] Invalid IPC 
credentials (1033732-1033746).
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: 
cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: HA 
Signon failed
  Apr  2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]:error: main: Aborting 
startup
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:error: 
pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100)
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:  warning: 
pcmk_child_exit: Pacemaker child process attrd no longer wishes to be 
respawned. Shutting ourselves down.
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
pcmk_shutdown_worker: Shuting down Pacemaker
  Apr  2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]:   notice: 
stop_child: Stopping crmd: Sent -15 to process 1033748
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_cib_control: 
Couldn't complete CIB registration 1 times... pause and retry
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:   notice: crm_shutdown: 
Requesting shutdown, upper limit is 120ms
  Apr  2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]:  warning: do_log: FSA: 
Input I_SHUTDOWN from crm_shutdown() re