I'm seeing this problem in another environment, similar deployment (3 lxc containers)
Apr 20 16:39:26 juju-machine-3-lxc-4 crm_verify[31774]: notice: crm_log_args: Invoked: crm_verify -V -p Apr 20 16:39:27 juju-machine-3-lxc-4 cibadmin[31786]: notice: crm_log_args: Invoked: cibadmin -p -P Apr 20 16:50:01 juju-machine-3-lxc-4 cib[780]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Apr 20 16:50:01 juju-machine-3-lxc-4 cib[780]: error: cib_cs_destroy: Corosync connection lost! Exiting. Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[785]: error: crmd_quorum_destroy: connection terminated Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Apr 20 16:50:01 juju-machine-3-lxc-4 stonith-ng[781]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[785]: notice: crmd_exit: Forcing immediate exit: Link has been severed (67) Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]: warning: qb_ipcs_event_sendv: new_event_notification (782-785-6): Bad file descriptor (9) Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]: warning: send_client_notify: Notification of client crmd/8ad990ba-cf09-4ba3-b74b-a7d05d377a1b failed Apr 20 16:50:01 juju-machine-3-lxc-4 lrmd[782]: error: crm_abort: crm_glib_handler: Forked child 760 to record non-fatal assert at logging.c:63 : Source ID 4601370 was not found when attempting to remove it Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: error: pcmk_child_exit: Child process cib (780) exited: Invalid argument (22) Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: notice: pcmk_process_exit: Respawning failed child process: cib Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: error: pcmk_child_exit: Child process crmd (785) exited: Link has been severed (67) Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: notice: pcmk_process_exit: Respawning failed child process: crmd Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: crit: attrd_cs_destroy: Lost connection to Corosync service! Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: notice: main: Exiting... Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: notice: main: Disconnecting client 0x7ff985e478e0, pid=785... Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: error: pcmk_cpg_dispatch: Connection to the CPG API failed: Library error (2) Apr 20 16:50:01 juju-machine-3-lxc-4 pacemakerd[773]: error: mcp_cpg_destroy: Connection destroyed Apr 20 16:50:01 juju-machine-3-lxc-4 attrd[783]: error: attrd_cib_connection_destroy: Connection to the CIB terminated... Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: debug: crm_update_callsites: Enabling callsites based on priority=7, files=(null), functions=(null), formats=(null), tags=(null) Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[767]: debug: crm_update_callsites: Enabling callsites based on priority=7, files=(null), functions=(null), formats=(null), tags=(null) Apr 20 16:50:01 juju-machine-3-lxc-4 crmd[767]: notice: main: CRM Git Version: 42f2063 Apr 20 16:50:01 juju-machine-3-lxc-4 stonith-ng[781]: error: stonith_peer_cs_destroy: Corosync connection terminated Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 2 Apr 20 16:50:01 juju-machine-3-lxc-4 cib[761]: crit: cib_init: Cannot sign in to the cluster... terminating Apr 20 16:50:02 juju-machine-3-lxc-4 crmd[767]: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry Apr 20 16:50:05 juju-machine-3-lxc-4 crmd[767]: warning: do_cib_control: Couldn't complete CIB registration 2 times... pause and retry These are the only processes running in one of the nodes: root 782 0.0 0.0 81464 1828 ? Ss Feb12 25:13 /usr/lib/pacemaker/lrmd haclust+ 784 0.0 0.0 73920 776 ? Ss Feb12 8:25 /usr/lib/pacemaker/pengine root 780 0.8 0.0 130256 4152 ? Ssl 16:50 0:00 /usr/sbin/corosync A possible explanation could be: http://thread.gmane.org/gmane.linux.highavailability.corosync/592/focus=639 I only have logs for one of the nodes, I'm trying to get logs of the other 2 nodes to get a better understanding of what was happening with the communication. -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to pacemaker in Ubuntu. https://bugs.launchpad.net/bugs/1439649 Title: Pacemaker unable to communicate with corosync on restart under lxc Status in lxc package in Ubuntu: Confirmed Status in pacemaker package in Ubuntu: Confirmed Bug description: We've seen this a few times with three node clusters, all running in LXC containers; pacemaker fails to restart correctly as it can't communicate with corosync, resulting in a down cluster. Rebooting the containers resolves the issue, so suspect some sort of bad state either in corosync or pacemaker. Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: mcp_read_config: Configured corosync to accept connections from group 115: Library error (2) Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: main: Starting Pacemaker 1.1.10 (Build: 42f2063): generated-manpages agent-manpages ncurses libqb-logging libqb-ipc lha-fencing upstart nagios heartbeat corosync-native snmp libesmtp Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: cluster_connect_quorum: Quorum acquired Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: corosync_node_name: Unable to get node name for nodeid 1000 Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: corosync_node_name: Unable to get node name for nodeid 1001 Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: corosync_node_name: Unable to get node name for nodeid 1003 Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: corosync_node_name: Unable to get node name for nodeid 1001 Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: get_node_name: Defaulting to uname -n for the local corosync node name Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: crm_update_peer_state: pcmk_quorum_notification: Node juju-machine-4-lxc-4[1001] - state is now member (was (null)) Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: corosync_node_name: Unable to get node name for nodeid 1003 Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is now member (was (null)) Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: notice: main: CRM Git Version: 42f2063 Apr 2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Apr 2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]: notice: corosync_node_name: Unable to get node name for nodeid 1001 Apr 2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]: notice: get_node_name: Defaulting to uname -n for the local corosync node name Apr 2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Apr 2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]: [MAIN ] Denied connection attempt from 109:115 Apr 2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]: [QB ] Invalid IPC credentials (1033732-1033746). Apr 2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11 Apr 2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]: error: main: HA Signon failed Apr 2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]: error: main: Aborting startup Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: error: pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100) Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: warning: pcmk_child_exit: Pacemaker child process attrd no longer wishes to be respawned. Shutting ourselves down. Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: pcmk_shutdown_worker: Shuting down Pacemaker Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: stop_child: Stopping crmd: Sent -15 to process 1033748 Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: warning: do_log: FSA: Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: notice: do_state_transition: State transition S_STARTING -> S_STOPPING [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ] Apr 2 11:41:32 juju-machine-4-lxc-4 cib[1033743]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: notice: terminate_cs_connection: Disconnecting from Corosync Apr 2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]: [MAIN ] Denied connection attempt from 109:115 Apr 2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]: [QB ] Invalid IPC credentials (1033732-1033743). Apr 2 11:41:32 juju-machine-4-lxc-4 cib[1033743]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11 Apr 2 11:41:32 juju-machine-4-lxc-4 cib[1033743]: crit: cib_init: Cannot sign in to the cluster... terminating Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: stop_child: Stopping pengine: Sent -15 to process 1033747 Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: error: pcmk_child_exit: Child process cib (1033743) exited: Network is down (100) Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: warning: pcmk_child_exit: Pacemaker child process cib no longer wishes to be respawned. Shutting ourselves down. Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: stop_child: Stopping lrmd: Sent -15 to process 1033745 Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: stop_child: Stopping stonith-ng: Sent -15 to process 1033744 Apr 2 11:41:34 juju-machine-4-lxc-4 corosync[1033732]: [TOTEM ] A new membership (10.245.160.62:284) was formed. Members joined: 1000 Apr 2 11:41:41 juju-machine-4-lxc-4 stonith-ng[1033744]: error: setup_cib: Could not connect to the CIB service: Transport endpoint is not connected (-107) Apr 2 11:41:41 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: pcmk_shutdown_worker: Shutdown complete Apr 2 11:41:41 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: pcmk_shutdown_worker: Attempting to inhibit respawning after fatal error ProblemType: Bug DistroRelease: Ubuntu 14.04 Package: pacemaker 1.1.10+git20130802-1ubuntu2.3 ProcVersionSignature: User Name 3.16.0-33.44~14.04.1-generic 3.16.7-ckt7 Uname: Linux 3.16.0-33-generic x86_64 NonfreeKernelModules: vhost_net vhost macvtap macvlan xt_conntrack ipt_REJECT ip6table_filter ip6_tables ebtable_nat ebtables veth 8021q garp xt_CHECKSUM mrp iptable_mangle ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables nbd ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi openvswitch gre vxlan dm_crypt bridge dm_multipath intel_rapl stp scsi_dh x86_pkg_temp_thermal llc intel_powerclamp coretemp ioatdma kvm_intel ipmi_si joydev sb_edac kvm hpwdt hpilo dca ipmi_msghandler acpi_power_meter edac_core lpc_ich shpchp serio_raw mac_hid xfs libcrc32c btrfs xor raid6_pq hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse tg3 ptp pata_acpi hpsa pps_core ApportVersion: 2.14.1-0ubuntu3.7 Architecture: amd64 Date: Thu Apr 2 11:42:18 2015 SourcePackage: pacemaker UpgradeStatus: No upgrade log present (probably fresh install) To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1439649/+subscriptions _______________________________________________ Mailing list: https://launchpad.net/~ubuntu-ha Post to : [email protected] Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp

