"Users" <[email protected]> schrieb am 18.07.2018 15:46:09:
> Von: Jan Pokorný <[email protected]> > An: [email protected] > Datum: 18.07.2018 15:46 > Betreff: Re: [ClusterLabs] corosync/dlm fencing? > Gesendet von: "Users" <[email protected]> > > Just minor clarifications (without changing the validity) below: > > On 17/07/18 21:28 +0200, Jan Pokorný wrote: > > > > On 16/07/18 11:44 +0200, Philipp Achmüller wrote: > >> Unfortunatly it is not obvious for me - the "grep fence" is attached > >> in my original message. > > > > Sifting your logs a bit: > > > >> ------------------- > >> Node: siteb-2 (DC): > >> 2018-06-28T09:02:23.282153+02:00 siteb-2 pengine[189259]: > notice: Move stonith-sbd#011(Started sitea-1 -> siteb-1) > >> [...] > >> 2018-06-28T09:02:23.284575+02:00 siteb-2 crmd[189260]: notice: > Initiating stop operation stonith-sbd_stop_0 on sitea-1 > >> [...] > >> 2018-06-28T09:02:23.288254+02:00 siteb-2 crmd[189260]: notice: > Initiating start operation stonith-sbd_start_0 on siteb-1 > >> [...] > >> 2018-06-28T09:02:38.414440+02:00 siteb-2 corosync[189245]: > [TOTEM ] A processor failed, forming new configuration. > >> 2018-06-28T09:02:52.080141+02:00 siteb-2 corosync[189245]: > [TOTEM ] A new membership (192.168.121.55:2012) was formed. Members left: 2 > >> 2018-06-28T09:02:52.080537+02:00 siteb-2 corosync[189245]: > [TOTEM ] Failed to receive the leave message. failed: 2 > >> 2018-06-28T09:02:52.083415+02:00 siteb-2 attrd[189258]: notice: > Node siteb-1 state is now lost > >> [...] > >> 2018-06-28T09:02:52.084054+02:00 siteb-2 crmd[189260]: warning: > No reason to expect node 2 to be down > >> [...] > >> 2018-06-28T09:02:52.084409+02:00 siteb-2 corosync[189245]: > [QUORUM] Members[3]: 1 3 4 > >> 2018-06-28T09:02:52.084492+02:00 siteb-2 corosync[189245]: > [MAIN ] Completed service synchronization, ready to provide service. > >> [...] > >> 2018-06-28T09:02:52.085210+02:00 siteb-2 kernel: [80872.012486] > dlm: closing connection to node 2 > >> [...] > >> 2018-06-28T09:02:53.098683+02:00 siteb-2 pengine[189259]: > warning: Scheduling Node siteb-1 for STONITH > > > >> ------------------- > >> Node sitea-1: > >> 2018-06-28T09:02:38.413748+02:00 sitea-1 corosync[6661]: [TOTEM > ] A processor failed, forming new configuration. > >> 2018-06-28T09:02:52.079905+02:00 sitea-1 corosync[6661]: [TOTEM > ] A new membership (192.168.121.55:2012) was formed. Members left: 2 > >> 2018-06-28T09:02:52.080306+02:00 sitea-1 corosync[6661]: [TOTEM > ] Failed to receive the leave message. failed: 2 > >> 2018-06-28T09:02:52.082619+02:00 sitea-1 cib[9021]: notice: > Node siteb-1 state is now lost > >> [...] > >> 2018-06-28T09:02:52.083429+02:00 sitea-1 corosync[6661]: > [QUORUM] Members[3]: 1 3 4 > >> 2018-06-28T09:02:52.083521+02:00 sitea-1 corosync[6661]: [MAIN > ] Completed service synchronization, ready to provide service. > >> 2018-06-28T09:02:52.083606+02:00 sitea-1 crmd[9031]: notice: > Node siteb-1 state is now lost > >> 2018-06-28T09:02:52.084290+02:00 sitea-1 dlm_controld[73416]: > 59514 fence request 2 pid 171087 nodedown time 1530169372 fence_all > dlm_stonith > >> 2018-06-28T09:02:52.085446+02:00 sitea-1 kernel: [59508.568940] > dlm: closing connection to node 2 > >> 2018-06-28T09:02:52.109393+02:00 sitea-1 dlm_stonith: > stonith_api_time: Found 0 entries for 2/(null): 0 in progress, 0 completed > >> 2018-06-28T09:02:52.110167+02:00 sitea-1 stonith-ng[9022]: > notice: Client stonith-api.171087.d3c59fc2 wants to fence (reboot) > '2' with device '(any)' > >> 2018-06-28T09:02:52.113257+02:00 sitea-1 stonith-ng[9022]: > notice: Requesting peer fencing (reboot) of siteb-1 > >> 2018-06-28T09:03:29.096714+02:00 sitea-1 stonith-ng[9022]: > notice: Operation reboot of siteb-1 by sitea-2 for stonith-api. > [email protected]: OK > >> 2018-06-28T09:03:29.097152+02:00 sitea-1 stonith-api[171087]: > stonith_api_kick: Node 2/(null) kicked: reboot > >> 2018-06-28T09:03:29.097426+02:00 sitea-1 crmd[9031]: notice: > Peer lnx0361b was terminated (reboot) by sitea-2 on behalf of > stonith-api.171087: OK > >> 2018-06-28T09:03:30.098657+02:00 sitea-1 dlm_controld[73416]: > 59552 fence result 2 pid 171087 result 0 exit status > >> 2018-06-28T09:03:30.099730+02:00 sitea-1 dlm_controld[73416]: > 59552 fence status 2 receive 0 from 1 walltime 1530169410 local 59552 > > > >> ------------------- > >> Node sitea-2: > >> 2018-06-28T09:02:38.412808+02:00 sitea-2 corosync[6570]: [TOTEM > ] A processor failed, forming new configuration. > >> 2018-06-28T09:02:52.078249+02:00 sitea-2 corosync[6570]: [TOTEM > ] A new membership (192.168.121.55:2012) was formed. Members left: 2 > >> 2018-06-28T09:02:52.078359+02:00 sitea-2 corosync[6570]: [TOTEM > ] Failed to receive the leave message. failed: 2 > >> 2018-06-28T09:02:52.081949+02:00 sitea-2 cib[9655]: notice: > Node siteb-1 state is now lost > >> [...] > >> 2018-06-28T09:02:52.082653+02:00 sitea-2 corosync[6570]: > [QUORUM] Members[3]: 1 3 4 > >> 2018-06-28T09:02:52.082739+02:00 sitea-2 corosync[6570]: [MAIN > ] Completed service synchronization, ready to provide service. > >> [...] > >> 2018-06-28T09:02:52.495697+02:00 sitea-2 stonith-ng[9656]: > notice: stonith-sbd can fence (reboot) siteb-1: dynamic-list > >> 2018-06-28T09:02:52.495902+02:00 sitea-2 stonith-ng[9656]: > notice: Delaying reboot on stonith-sbd for 25358ms (timeout=300s) > >> 2018-06-28T09:03:29.093957+02:00 sitea-2 stonith-ng[9656]: > notice: Operation 'reboot' [231293] (call 2 from stonith-api.171087) > for host 'siteb-1' with device 'stonith-sbd' returned: 0 (OK) > >> 2018-06-28T09:03:29.096254+02:00 sitea-2 stonith-ng[9656]: > notice: Operation reboot of siteb-1 by sitea-2 for stonith-api. > [email protected]: OK > >> 2018-06-28T09:03:29.096769+02:00 sitea-2 crmd[9660]: notice: > Peer siteb-1 was terminated (reboot) by sitea-2 on behalf of > stonith-api.171087: OK > > > >> ------------------- > >> Node sideb-1 has no entries during this timeframe > >> > >> during standby corosync should be up/running - so may the "Failed to > >> receive the leave message" will be a Problem? > > > > True, it should normally stay running, and that's actually the problem > > here and you just seem to be confusing consequence with consequence: > > "consequence with the cause", apparently > > > > > - first: the corosync membership got broken on siteb-1 node > > (hard to tell why, the only thing we can observe is that > > sbd monitoring was moved there from sitea-1) > > > > - then: dlm_controld on sitea-1, which already observed that membership > > issue of siteb-1, proceed to ask pacemaker to fence siteb-1, which > > was cheerfully executed by sitea-2 node > > > > I cannot really tell what the root cause was here, but observing PID > > as high as 171087, I wonder if you have recent enough libqb (0.17.2): > > if it's unclear, 0.17.2 as the lowest version that's fixed following version is currently installed with SP3: libqb0-1.0.1-2.15.x86_64 > > > https://bugzilla.redhat.com/show_bug.cgi?id=1114852 > > > > (and would perhaps fit: new sbd process/es with high enough PIDs, > > actually, crm_attribute processes arising from stonith:external/sbd > > (fun fact: fence_sbd agent from fence-agents project, using a bit > different API for fencing, doesn't need to resort to relying on > stonith-timeout cluster property, so it doesn't involved such > "calling home") > > > exercising libqb-backed IPC -> something bad happening like > > corosync getting stuck or crashing -> node kicked out...) > > -- > Nazdar, > Jan (Poki) > [Anhang "attlmf24.dat" gelöscht von Philipp Achmüller/ARZ/AT] > _______________________________________________ > Users mailing list: [email protected] > https://lists.clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org
_______________________________________________ Users mailing list: [email protected] https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
