23.03.2016 19:52, Vladislav Bogdanov wrote: > 23.03.2016 19:39, Ken Gaillot wrote: >> On 03/23/2016 07:35 AM, Vladislav Bogdanov wrote: >>> Hi! >>> >>> It seems like atomic attrd in post-1.1.14 (eb89393) does not >>> fully clean node cache after node is removed. >> >> Is this a regression? Or have you only tried it with this version? > > Only with this one. > >> >>> After our QA guys remove node wa-test-server-ha-03 from a two-node >>> cluster: >>> * stop pacemaker and corosync on wa-test-server-ha-03 >>> * remove node wa-test-server-ha-03 from corosync nodelist on >>> wa-test-server-ha-04 >>> * tune votequorum settings >>> * reload corosync on wa-test-server-ha-04 >>> * remove node from pacemaker on wa-test-server-ha-04 >>> * delete everything from /var/lib/pacemaker/cib on wa-test-server-ha-03 >>> , and then join it with the different corosync ID (but with the same >>> node name), >>> we see the following in logs: >>> >>> Leave node 1 (wa-test-server-ha-03): >>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]: notice: >>> crm_update_peer_proc: Node wa-test-server-ha-03[1] - state is now >>> lost (was member) >>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]: notice: Removing >>> all wa-test-server-ha-03 (1) attributes for attrd_peer_change_cb >>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]: notice: Lost >>> attribute writer wa-test-server-ha-03 >>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]: notice: Removing >>> wa-test-server-ha-03/1 from the membership list >>> Mar 23 04:19:53 wa-test-server-ha-04 attrd[25962]: notice: Purged 1 >>> peers with id=1 and/or uname=wa-test-server-ha-03 from the membership >>> cache >>> Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]: notice: >>> Processing peer-remove from wa-test-server-ha-04: wa-test-server-ha-03 0 >>> Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]: notice: Removing >>> all wa-test-server-ha-03 (0) attributes for wa-test-server-ha-04 >>> Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]: notice: Removing >>> wa-test-server-ha-03/1 from the membership list >>> Mar 23 04:19:56 wa-test-server-ha-04 attrd[25962]: notice: Purged 1 >>> peers with id=0 and/or uname=wa-test-server-ha-03 from the membership >>> cache >>> >>> Join node 3 (the same one, wa-test-server-ha-03, but ID differs): >>> Mar 23 04:21:23 wa-test-server-ha-04 attrd[25962]: notice: >>> crm_update_peer_proc: Node wa-test-server-ha-03[3] - state is now >>> member (was (null)) >>> Mar 23 04:21:26 wa-test-server-ha-04 attrd[25962]: warning: >>> crm_find_peer: Node 3/wa-test-server-ha-03 = 0x201bf30 - >>> a4cbcdeb-c36a-4a0e-8ed6-c45b3db89296 >>> Mar 23 04:21:26 wa-test-server-ha-04 attrd[25962]: warning: >>> crm_find_peer: Node 2/wa-test-server-ha-04 = 0x1f90e20 - >>> 6c18faa1-f8c2-4b0c-907c-20db450e2e79 >>> Mar 23 04:21:26 wa-test-server-ha-04 attrd[25962]: crit: Node 1 >>> and 3 share the same name 'wa-test-server-ha-03' >> >> It took me a while to understand the above combination of messages. This >> is not node 3 joining. This is node 1 joining after node 3 has already >> been seen. > > Hmmm... > corosync.conf and corosync-cmapctl both say it is 3 > Also, cib lists it as 3 and lrmd puts its status records under 3.
I mean: <node_state id="3" uname="wa-test-server-ha-03" crmd="online" crm-debug-origin="do_update_resource" in_ccm="true" join="member" expected="member"> <lrm id="3"> <lrm_resources> ... </lrm_resources> </lrm> </node_state> <node_state id="1"> <transient_attributes id="1"> <instance_attributes id="status-1"> <nvpair id="status-1-shutdown" name="shutdown" value="0"/> <nvpair id="status-1-master-rabbitmq-local" name="master-rabbitmq-local" value="1"/> <nvpair id="status-1-master-meta-0-0-drbd" name="master-meta-0-0-drbd" value="10000"/> <nvpair id="status-1-master-staging-0-0-drbd" name="master-staging-0-0-drbd" value="10000"/> <nvpair id="status-1-rabbit-start-time" name="rabbit-start-time" value="1458732136"/> </instance_attributes> </transient_attributes> </node_state> > > Actually issue is that drbd resources are not promoted because their > master attributes go to section with node-id 1. And that is the only > reason why we found that. Everything not related to volatile attributes > works well. > >> >> The warnings are a complete dump of the peer cache. So you can see that >> wa-test-server-ha-03 is listed only once, with id 3. >> >> The critical message ("Node 1 and 3") lists the new id first and the >> found ID second. So id 1 is what it's trying to add to the cache. > > But there is also 'Node 'wa-test-server-ha-03' has changed its ID from 1 > to 3' - it goes first. Does that matter? > >> >> Did you update the node ID in corosync.conf on *both* nodes? > > Sure. > It is automatically copied to a node being joined. > >> >>> Mar 23 04:21:29 wa-test-server-ha-04 attrd[25962]: notice: Node >>> 'wa-test-server-ha-03' has changed its ID from 1 to 3 >>> Mar 23 04:21:29 wa-test-server-ha-04 attrd[25962]: warning: >>> crm_find_peer: Node 3/wa-test-server-ha-03 = 0x201bf30 - >>> a4cbcdeb-c36a-4a0e-8ed6-c45b3db89296 >>> Mar 23 04:21:29 wa-test-server-ha-04 attrd[25962]: warning: >>> crm_find_peer: Node 2/wa-test-server-ha-04 = 0x1f90e20 - >>> 6c18faa1-f8c2-4b0c-907c-20db450e2e79 >>> Mar 23 04:21:29 wa-test-server-ha-04 attrd[25962]: crit: Node 1 >>> and 3 share the same name 'wa-test-server-ha-03' >>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]: notice: Node >>> 'wa-test-server-ha-03' has changed its ID from 1 to 3 >>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]: warning: >>> crm_find_peer: Node 3/wa-test-server-ha-03 = 0x201bf30 - >>> a4cbcdeb-c36a-4a0e-8ed6-c45b3db89296 >>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]: warning: >>> crm_find_peer: Node 2/wa-test-server-ha-04 = 0x1f90e20 - >>> 6c18faa1-f8c2-4b0c-907c-20db450e2e79 >>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]: crit: Node 1 >>> and 3 share the same name 'wa-test-server-ha-03' >>> Mar 23 04:21:31 wa-test-server-ha-04 attrd[25962]: notice: Node >>> 'wa-test-server-ha-03' has changed its ID from 3 to 1 >>> ... >>> >>> On the node being joined: >>> Mar 23 04:21:23 wa-test-server-ha-03 attrd[15260]: notice: >>> Connecting to cluster infrastructure: corosync >>> Mar 23 04:21:23 wa-test-server-ha-03 attrd[15260]: notice: >>> crm_update_peer_proc: Node wa-test-server-ha-03[3] - state is now >>> member (was (null)) >>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: notice: >>> crm_update_peer_proc: Node wa-test-server-ha-04[2] - state is now >>> member (was (null)) >>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: notice: Recorded >>> attribute writer: wa-test-server-ha-04 >>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: notice: >>> Processing sync-response from wa-test-server-ha-04 >>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: warning: >>> crm_find_peer: Node 2/wa-test-server-ha-04 = 0xdfe620 - >>> ad08ca96-295a-4fa4-99f9-8c8a2d0b6ac0 >>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: warning: >>> crm_find_peer: Node 3/wa-test-server-ha-03 = 0xd7ae20 - >>> f85bdc4b-a3ee-47ff-bdd5-7c1dcf9fe97c >>> Mar 23 04:21:24 wa-test-server-ha-03 attrd[15260]: crit: Node 1 >>> and 3 share the same name 'wa-test-server-ha-03' >>> Mar 23 04:21:26 wa-test-server-ha-03 attrd[15260]: notice: Node >>> 'wa-test-server-ha-03' has changed its ID from 1 to 3 >>> Mar 23 04:21:26 wa-test-server-ha-03 attrd[15260]: notice: Updating >>> all attributes after cib_refresh_notify event >>> Mar 23 04:21:26 wa-test-server-ha-03 attrd[15260]: notice: Updating >>> all attributes after cib_refresh_notify event >>> >>> >>> CIB status section after that contains entries for three nodes, with >>> IDs 1, 2 and 3. >>> For node 2 (that which remained) there are both transient attributes >>> and lrm statuses >>> For node 1 (that have been removed) - only transient attributes >>> For node 3 (newly joined) - only lrm statuses >>> >>> That makes me think that not everything is removed (stale caches?) >>> from attrd during node leave. >>> >>> >>> Is there some other information I can provide to solve this issue? >>> >>> Best, >>> Vladislav >> >> >> _______________________________________________ >> Users mailing list: Users@clusterlabs.org >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org >> > > > _______________________________________________ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org _______________________________________________ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org