Hi all, I am facing a strange issue with attrd while doing some testing on a three node cluster with the pgsqlms RA [1].
pgsqld is my pgsqlms resource in the cluster. pgsql-ha is the master/slave setup on top of pgsqld. Before triggering a failure, here was the situation: * centos1: pgsql-ha slave * centos2: pgsql-ha slave * centos3: pgsql-ha master Then we triggered a failure: the node centos3 has been kill using echo c > /proc/sysrq-trigger In this situation, PEngine provide a transition where : * centos3 is fenced * pgsql-ha on centos2 is promoted During the pre-promote notify action in the pgsqlms RA, each remaining slave are setting a node attribute called lsn_location, see: https://github.com/dalibo/PAF/blob/master/script/pgsqlms#L1504 crm_attribute -l reboot -t status --node "$nodename" \ --name lsn_location --update "$node_lsn" During the promotion action in the pgsqlms RA, the RA check the lsn_location of the all the nodes to make sure the local one is higher or equal to all others. See: https://github.com/dalibo/PAF/blob/master/script/pgsqlms#L1292 This is where we face a attrd behavior we don't understand. Despite we can see in the log the RA was able to set its local "lsn_location", during the promotion action, the RA was unable to read its local lsn_location": pgsqlms(pgsqld)[9003]: 2016/04/22_14:46:16 INFO: pgsql_notify: promoting instance on node "centos2" pgsqlms(pgsqld)[9003]: 2016/04/22_14:46:16 INFO: pgsql_notify: current node LSN: 0/1EE24000 [...] pgsqlms(pgsqld)[9023]: 2016/04/22_14:46:16 CRIT: pgsql_promote: can not get current node LSN location Apr 22 14:46:16 [5864] centos2 lrmd: notice: operation_finished: pgsqld_promote_0:9023:stderr [ Error performing operation: No such device or address ] Apr 22 14:46:16 [5864] centos2 lrmd: info: log_finished: finished - rsc:pgsqld action:promote call_id:211 pid:9023 exit-code:1 exec-time:107ms queue-time:0ms The error comes from: https://github.com/dalibo/PAF/blob/master/script/pgsqlms#L1320 **After** this error, we can see in the log file attrd set the "lsn_location" of centos2: Apr 22 14:46:16 [5865] centos2 attrd: info: attrd_peer_update: Setting lsn_location[centos2]: (null) -> 0/1EE24000 from centos2 Apr 22 14:46:16 [5865] centos2 attrd: info: write_attribute: Write out of 'lsn_location' delayed: update 189 in progress As I understand it, the call of crm_attribute during pre-promote notification has been taken into account AFTER the "promote" action, leading to this error. Am I right? Why and how this could happen? Could it comes from the dampen parameter? We did not set any dampen anywhere, is there a default value in the cluster setup? Could we avoid this behavior? Please, find in attachment a tarball with : * all cluster logfiles from the three nodes * the content of /var/lib/pacemaker from the three nodes: * CIBs * PEngine transitions Regards, [1] https://github.com/dalibo/PAF -- Jehan-Guillaume de Rorthais Dalibo _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org