Serge, I did double check that the pacemaker processes were running under hacluster/haclient uid/gid. I will double check for my own sanity (I may have seen one running as root). However, according to the pacemaker docs that I referenced above, root and hacluster users should always have full access (which is somewhat in conflict with the INSTALL file you reference):
> Users are regular UNIX users, so the same user accounts must be present on > all nodes in the cluster. > > All user accounts must be in the haclient group. > > Pacemaker 1.1.5 or newer must be installed on all cluster nodes. > > The CIB must be configured to use the pacemaker-1.1 or 1.2 schema. This can > be set by running: > > cibadmin --modify --xml-text '<cib validate-with="pacemaker-1.1"/>' > The enable-acl option must be set. If ACLs are not explicitly enabled, the > previous behaviour will be used (i.e. all users in the haclient group have > full access): > > crm configure property enable-acl=true > Once this is done, ACLs can be configured as described below. > > Note that the root and hacluster users will always have full access. > > If nonprivileged users will be using the crm shell and CLI tools (as opposed > to only using Hawk or the Python GUI) they will need to have /usr/sbin added > to their path. If it were a necessity to add the ACL entry, then I would have expected that the hacluster charm code would always have needed this requirement and pacemaker should have always denied access. Additionally, since the charm has done no configuration of the ACLs, I would expect all nodes to get denied or allowed the same. Instead, what has been observed is that *some* of the nodes in the cluster have the pacemaker process successfully communicate with the corosync process, while others get this invalid credentials error that is seen. I've already proposed a change (which has been merged into the /next branches of the hacluster charm) which incorporates JuanJo's comments (thank you JuanJo!) by explicitly defining the ACL entry, but would better like to understand why the inconsistent behavior. -- You received this bug notification because you are a member of Ubuntu High Availability Team, which is subscribed to pacemaker in Ubuntu. https://bugs.launchpad.net/bugs/1439649 Title: Pacemaker unable to communicate with corosync on restart under lxc Status in lxc package in Ubuntu: Confirmed Status in pacemaker package in Ubuntu: Confirmed Bug description: We've seen this a few times with three node clusters, all running in LXC containers; pacemaker fails to restart correctly as it can't communicate with corosync, resulting in a down cluster. Rebooting the containers resolves the issue, so suspect some sort of bad state either in corosync or pacemaker. Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: mcp_read_config: Configured corosync to accept connections from group 115: Library error (2) Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: main: Starting Pacemaker 1.1.10 (Build: 42f2063): generated-manpages agent-manpages ncurses libqb-logging libqb-ipc lha-fencing upstart nagios heartbeat corosync-native snmp libesmtp Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: cluster_connect_quorum: Quorum acquired Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: corosync_node_name: Unable to get node name for nodeid 1000 Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: corosync_node_name: Unable to get node name for nodeid 1001 Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: corosync_node_name: Unable to get node name for nodeid 1003 Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: corosync_node_name: Unable to get node name for nodeid 1001 Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: get_node_name: Defaulting to uname -n for the local corosync node name Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: crm_update_peer_state: pcmk_quorum_notification: Node juju-machine-4-lxc-4[1001] - state is now member (was (null)) Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: corosync_node_name: Unable to get node name for nodeid 1003 Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: crm_update_peer_state: pcmk_quorum_notification: Node (null)[1003] - state is now member (was (null)) Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: notice: main: CRM Git Version: 42f2063 Apr 2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Apr 2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]: notice: corosync_node_name: Unable to get node name for nodeid 1001 Apr 2 11:41:32 juju-machine-4-lxc-4 stonith-ng[1033744]: notice: get_node_name: Defaulting to uname -n for the local corosync node name Apr 2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Apr 2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]: [MAIN ] Denied connection attempt from 109:115 Apr 2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]: [QB ] Invalid IPC credentials (1033732-1033746). Apr 2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11 Apr 2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]: error: main: HA Signon failed Apr 2 11:41:32 juju-machine-4-lxc-4 attrd[1033746]: error: main: Aborting startup Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: error: pcmk_child_exit: Child process attrd (1033746) exited: Network is down (100) Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: warning: pcmk_child_exit: Pacemaker child process attrd no longer wishes to be respawned. Shutting ourselves down. Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: pcmk_shutdown_worker: Shuting down Pacemaker Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: stop_child: Stopping crmd: Sent -15 to process 1033748 Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: warning: do_cib_control: Couldn't complete CIB registration 1 times... pause and retry Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: notice: crm_shutdown: Requesting shutdown, upper limit is 1200000ms Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: warning: do_log: FSA: Input I_SHUTDOWN from crm_shutdown() received in state S_STARTING Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: notice: do_state_transition: State transition S_STARTING -> S_STOPPING [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ] Apr 2 11:41:32 juju-machine-4-lxc-4 cib[1033743]: notice: crm_cluster_connect: Connecting to cluster infrastructure: corosync Apr 2 11:41:32 juju-machine-4-lxc-4 crmd[1033748]: notice: terminate_cs_connection: Disconnecting from Corosync Apr 2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]: [MAIN ] Denied connection attempt from 109:115 Apr 2 11:41:32 juju-machine-4-lxc-4 corosync[1033732]: [QB ] Invalid IPC credentials (1033732-1033743). Apr 2 11:41:32 juju-machine-4-lxc-4 cib[1033743]: error: cluster_connect_cpg: Could not connect to the Cluster Process Group API: 11 Apr 2 11:41:32 juju-machine-4-lxc-4 cib[1033743]: crit: cib_init: Cannot sign in to the cluster... terminating Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: stop_child: Stopping pengine: Sent -15 to process 1033747 Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: error: pcmk_child_exit: Child process cib (1033743) exited: Network is down (100) Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: warning: pcmk_child_exit: Pacemaker child process cib no longer wishes to be respawned. Shutting ourselves down. Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: stop_child: Stopping lrmd: Sent -15 to process 1033745 Apr 2 11:41:32 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: stop_child: Stopping stonith-ng: Sent -15 to process 1033744 Apr 2 11:41:34 juju-machine-4-lxc-4 corosync[1033732]: [TOTEM ] A new membership (10.245.160.62:284) was formed. Members joined: 1000 Apr 2 11:41:41 juju-machine-4-lxc-4 stonith-ng[1033744]: error: setup_cib: Could not connect to the CIB service: Transport endpoint is not connected (-107) Apr 2 11:41:41 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: pcmk_shutdown_worker: Shutdown complete Apr 2 11:41:41 juju-machine-4-lxc-4 pacemakerd[1033741]: notice: pcmk_shutdown_worker: Attempting to inhibit respawning after fatal error ProblemType: Bug DistroRelease: Ubuntu 14.04 Package: pacemaker 1.1.10+git20130802-1ubuntu2.3 ProcVersionSignature: User Name 3.16.0-33.44~14.04.1-generic 3.16.7-ckt7 Uname: Linux 3.16.0-33-generic x86_64 NonfreeKernelModules: vhost_net vhost macvtap macvlan xt_conntrack ipt_REJECT ip6table_filter ip6_tables ebtable_nat ebtables veth 8021q garp xt_CHECKSUM mrp iptable_mangle ipt_MASQUERADE iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables nbd ib_iser rdma_cm iw_cm ib_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi openvswitch gre vxlan dm_crypt bridge dm_multipath intel_rapl stp scsi_dh x86_pkg_temp_thermal llc intel_powerclamp coretemp ioatdma kvm_intel ipmi_si joydev sb_edac kvm hpwdt hpilo dca ipmi_msghandler acpi_power_meter edac_core lpc_ich shpchp serio_raw mac_hid xfs libcrc32c btrfs xor raid6_pq hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd psmouse tg3 ptp pata_acpi hpsa pps_core ApportVersion: 2.14.1-0ubuntu3.7 Architecture: amd64 Date: Thu Apr 2 11:42:18 2015 SourcePackage: pacemaker UpgradeStatus: No upgrade log present (probably fresh install) To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/lxc/+bug/1439649/+subscriptions _______________________________________________ Mailing list: https://launchpad.net/~ubuntu-ha Post to : [email protected] Unsubscribe : https://launchpad.net/~ubuntu-ha More help : https://help.launchpad.net/ListHelp

