First thanks for your answer! On 3/7/18 11:16 AM, Jan Friesse wrote: > Thomas, > > >> Hi, >> >> first some background info for my questions I'm going to ask: >> We use corosync as a basis for our distributed realtime configuration >> file system (pmxcfs)[1]. > > nice > >> >> We got some reports of a completely hanging FS with the only >> correlations being high load, often IO, and most times a message that >> corosync did not got scheduled for longer than the token timeout. >> >> See this example from a three node cluster, first: >> >>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [MAIN ] Corosync >>> main process was not scheduled for 3767.3159 ms (threshold is 1320.0000 >>> ms). Consider token timeout increase. >> >> then we get a few logs that JOIN or LEAVE messages were thrown away >> (understandable for this event): >> >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [TOTEM ] JOIN or >> LEAVE message was thrown away during flush operation. >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [MAIN ] Corosync main >> process was not scheduled for 3767.3159 ms (threshold is 1320.0000 ms). >> Consider token timeout increase. >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE >> message was thrown away during flush operation. >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE >> message was thrown away during flush operation. >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE >> message was thrown away during flush operation. >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE >> message was thrown away during flush operation. >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE >> message was thrown away during flush operation. >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [TOTEM ] A new >> membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3 >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [TOTEM ] Failed to >> receive the leave message. failed: 2 3 >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] A new membership >> (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3 >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] Failed to receive >> the leave message. failed: 2 3 >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [QUORUM] >> Members[3]: 1 2 3 >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [MAIN ] Completed >> service synchronization, ready to provide service. >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [QUORUM] Members[3]: 1 2 3 >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [MAIN ] Completed service >> synchronization, ready to provide service. >> >> Until recently we stepped really in the dark and had everything from >> Kernel bugs to our filesystem logic as possible cause in mind... But >> then we had the luck to trigger this in our test systems and went to >> town with gdb on the core dump, finding that we can trigger this by >> pausing the leader (from our FS POV) for a short moment (may be shorter >> than the token timeout), so that a new leader get elected, and then >> resuming our leader node VM again. >> >> The problem I saw was that while the leader had a log entry which >> proved that he noticed his blackout: >>> [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: >>> 2 3 left: 2 3 > > I know it looks weird but it's perfectly fine and expected. >
It seemed OK, from this nodes POV, just the missing config change CB was a bit odd to us. >> >> our FS cpg_confchg_fn callback[2] was never called, thus it thought it > > That shouldn't happen > So we really should get a config change CB on the paused node after unpausing, with all other (online) nodes in both leave and join member list? Just asking again to confirm my thinking and that I did not misunderstood you. :) >> was still in sync and nothing ever happened, until another member >> triggered this callback, by either leaving or (re-)joining. >> >> Looking in the cpg.c code I saw that there's another callback, namely >> cpg_totem_confchg_fn, which seemed a bit odd as wew did not set that > > This callback is not necessary to have as long as information about cpg group > is enough. cpg_totem_confchg_fn contains information about all processors > (nodes). > OK, make sense. >> one... (I ain't the original author of the FS and it predates at least >> to 2010, so maybe cpg_initialize was not yet deprecated there, and >> thus model_initialize was not used then) > >> >> I switched over to using cpg_model_initialize and set the totem_confchg >> callback, but for the "blacked out node" it gets called twice after the >> event, but always shows all members... >> >> So to finally get to my questions: >> >> * why doesn't get the cpg_confchg_fn CB called when a node has a short >> blackout (i.e., corosync not being scheduled for a bit of time)? >> having all other nodes in it's leave and join list, as the log >> would suggests ("Members joined: 2 3 left: 2 3") > > I believe it was called but not when corosync was paused. > But I never see my registered callback called after unpausing and also when printing out the dispatch_data->id inside cpg_dispatch I do not get one with "MESSAGE_RES_CPG_CONFCHG_CALLBACK" after unpausing only a "MESSAGE_RES_CPG_TOTEM_CONFCHG_CALLBACK" >> >> * If that doesn't seems like a good idea, what can we use to really >> detect such a node blackout? > > It's not possible to detect from the affected node, but it must be detected > from other nodes. > You mean not directly, as in: we cannot differ between a legitimate fall out and such a pause, caused artificially or by real high IO. That makes sense. >> >> As a work around I added logic for when through a config change a node >> with a lower ID joined. The node which was leader until then triggers >> a CPG leave enforcing a cluster wide config change event to happen, >> which this time also the blacked out node gets and syncs then again >> This works, but does not feels really nice, IMO... > > Ok let me explain what exactly happened in your test and simplify it to two > nodes: > > - Node A and B are running corosync and same cpg application > - Node A is paused for time > token_timeout > - Node B detects that node A is not responding > - Creates new membership (with only node B) > - Sends notification to cpg app about Node A leave > - Node A is still paused so it cannot send notification into cpg app > - Node A is unpaused > - Pause is detected and start forming new membership > - From node A point of view, node B left - simply because node never > considers itself as left > - Node A is able to contact node B so final membership is A, B. > - Node A can finally deliver two cpg conf changes into cpg application. One > about left of node B and second one about join of node B. > Exactly the last point is missing here, and I'd also expect it to happen... I only get the config change on Node B for the Node A join, but Node A logs (as stated)... :/ > Now it's really cpg application problem to synchronize its data. Many > applications (usually FS) are using quorum together with fencing to find out, > which cluster partition is quorate and clean inquorate one. > > Hopefully my explanation help you and feel free to ask more questions! > They help, but I'm still a bit unsure about why the CB could not happen here, may need to dive a bit deeper into corosync :) > Regards, > Honza > >> >> help would be appreciated, much thanks! >> >> cheers, >> Thomas >> >> [1]: >> https://git.proxmox.com/?p=pve-cluster.git;a=tree;f=data/src;h=e5493468b456ba9fe3f681f387b4cd5b86e7ca08;hb=HEAD >> [2]: >> https://git.proxmox.com/?p=pve-cluster.git;a=blob;f=data/src/dfsm.c;h=cdf473e8226ab9706d693a457ae70c0809afa0fa;hb=HEAD#l1096 >> _______________________________________________ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org