Re: [ClusterLabs] corosync 2.4 CPG config change callback
Hi Thomas, Hi, Am 04/25/2018 um 09:57 AM schrieb Jan Friesse: Thomas Lamprecht napsal(a): On 4/24/18 6:38 PM, Jan Friesse wrote: On 4/6/18 10:59 AM, Jan Friesse wrote: Thomas Lamprecht napsal(a): Am 03/09/2018 um 05:26 PM schrieb Jan Friesse: I've tested it too and yes, you are 100% right. Bug is there and it's pretty easy to reproduce when node with lowest nodeid is paused. It's slightly harder when node with higher nodeid is paused. Do you were able to make some progress on this issue? Ya, kind of. Sadly I had to work on different problem, but I'm expecting to sent patch next week. I guess the different problems where the ones related to the issued CVEs :) Yep. Also I've spent quite a lot of the time thinking about best possible solution. CPG is quite old, it was full of weird bugs and risk of breakage is very high. Anyway, I've decided to not to try hack what is apparently broken and just go for risky but proper solution (= needs a LOT more testing, but so far looks good). I did not looked deep into how your revert plays out with the mentioned commits of the heuristics approach, but this fix would mean to bring corosync back to a state it had already, and thus was already battle tested? Yep, but not fully. Important change was to use joinlists as authoritative source of information about other node clients, so I believe that solved problems which should had been "solved" by downlist heuristics. Patch and approach seems good to me, with my limited knowledge, when looking at the various "bandaid" fix commits you mentioned. Patch is in PR (needle): https://github.com/corosync/corosync/pull/347 Much thanks! First tests work well here. I could not yet reproduce the problem with the patch applied in both, testcpg and our cluster configuration file system. That's good to hear :) I'll let it run Perfect. Just wanted to give some quick feedback. We deployed this to your community repository about a week ago (after another week of successful testing), we had no negative feedback or issues reported or seen yet, with (strong lower bound) > 10k systems running the fix by now. Thanks, that's exciting news. I saw just now that you merged it into needle and master, so, while a bit late, this just backs the confidence into the fix up. Definitively not late until it's released :) Much thanks for your, and the reviewers, work! Yep, you are welcomed. Honza cheers, Thomas ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync 2.4 CPG config change callback
Hi, Am 04/25/2018 um 09:57 AM schrieb Jan Friesse: Thomas Lamprecht napsal(a): On 4/24/18 6:38 PM, Jan Friesse wrote: On 4/6/18 10:59 AM, Jan Friesse wrote: Thomas Lamprecht napsal(a): Am 03/09/2018 um 05:26 PM schrieb Jan Friesse: I've tested it too and yes, you are 100% right. Bug is there and it's pretty easy to reproduce when node with lowest nodeid is paused. It's slightly harder when node with higher nodeid is paused. Do you were able to make some progress on this issue? Ya, kind of. Sadly I had to work on different problem, but I'm expecting to sent patch next week. I guess the different problems where the ones related to the issued CVEs :) Yep. Also I've spent quite a lot of the time thinking about best possible solution. CPG is quite old, it was full of weird bugs and risk of breakage is very high. Anyway, I've decided to not to try hack what is apparently broken and just go for risky but proper solution (= needs a LOT more testing, but so far looks good). I did not looked deep into how your revert plays out with the mentioned commits of the heuristics approach, but this fix would mean to bring corosync back to a state it had already, and thus was already battle tested? Yep, but not fully. Important change was to use joinlists as authoritative source of information about other node clients, so I believe that solved problems which should had been "solved" by downlist heuristics. Patch and approach seems good to me, with my limited knowledge, when looking at the various "bandaid" fix commits you mentioned. Patch is in PR (needle): https://github.com/corosync/corosync/pull/347 Much thanks! First tests work well here. I could not yet reproduce the problem with the patch applied in both, testcpg and our cluster configuration file system. That's good to hear :) I'll let it run Perfect. Just wanted to give some quick feedback. We deployed this to your community repository about a week ago (after another week of successful testing), we had no negative feedback or issues reported or seen yet, with (strong lower bound) > 10k systems running the fix by now. I saw just now that you merged it into needle and master, so, while a bit late, this just backs the confidence into the fix up. Much thanks for your, and the reviewers, work! cheers, Thomas ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync 2.4 CPG config change callback
Thomas Lamprecht napsal(a): Honza, On 4/24/18 6:38 PM, Jan Friesse wrote: On 4/6/18 10:59 AM, Jan Friesse wrote: Thomas Lamprecht napsal(a): Am 03/09/2018 um 05:26 PM schrieb Jan Friesse: I've tested it too and yes, you are 100% right. Bug is there and it's pretty easy to reproduce when node with lowest nodeid is paused. It's slightly harder when node with higher nodeid is paused. Do you were able to make some progress on this issue? Ya, kind of. Sadly I had to work on different problem, but I'm expecting to sent patch next week. I guess the different problems where the ones related to the issued CVEs :) Yep. Also I've spent quite a lot of the time thinking about best possible solution. CPG is quite old, it was full of weird bugs and risk of breakage is very high. Anyway, I've decided to not to try hack what is apparently broken and just go for risky but proper solution (= needs a LOT more testing, but so far looks good). I did not looked deep into how your revert plays out with the mentioned commits of the heuristics approach, but this fix would mean to bring corosync back to a state it had already, and thus was already battle tested? Yep, but not fully. Important change was to use joinlists as authoritative source of information about other node clients, so I believe that solved problems which should had been "solved" by downlist heuristics. Patch and approach seems good to me, with my limited knowledge, when looking at the various "bandaid" fix commits you mentioned. Patch is in PR (needle): https://github.com/corosync/corosync/pull/347 Much thanks! First tests work well here. I could not yet reproduce the problem with the patch applied in both, testcpg and our cluster configuration file system. That's good to hear :) I'll let it run Perfect. Regards, Honza cheers, Thomas ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync 2.4 CPG config change callback
Honza, On 4/24/18 6:38 PM, Jan Friesse wrote: >> On 4/6/18 10:59 AM, Jan Friesse wrote: >>> Thomas Lamprecht napsal(a): Am 03/09/2018 um 05:26 PM schrieb Jan Friesse: > I've tested it too and yes, you are 100% right. Bug is there and it's > pretty easy to reproduce when node with lowest nodeid is paused. It's > slightly harder when node with higher nodeid is paused. > Do you were able to make some progress on this issue? >>> >>> Ya, kind of. Sadly I had to work on different problem, but I'm expecting to >>> sent patch next week. >>> >> >> I guess the different problems where the ones related to the issued CVEs :) > > Yep. > > Also I've spent quite a lot of the time thinking about best possible > solution. CPG is quite old, it was full of weird bugs and risk of breakage is > very high. > > Anyway, I've decided to not to try hack what is apparently broken and just go > for risky but proper solution (= needs a LOT more testing, but so far looks > good). > I did not looked deep into how your revert plays out with the mentioned commits of the heuristics approach, but this fix would mean to bring corosync back to a state it had already, and thus was already battle tested? Patch and approach seems good to me, with my limited knowledge, when looking at the various "bandaid" fix commits you mentioned. > Patch is in PR (needle): https://github.com/corosync/corosync/pull/347 > Much thanks! First tests work well here. I could not yet reproduce the problem with the patch applied in both, testcpg and our cluster configuration file system. I'll let it run cheers, Thomas ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync 2.4 CPG config change callback
Thomas, Hi Honza On 4/6/18 10:59 AM, Jan Friesse wrote: Thomas Lamprecht napsal(a): Am 03/09/2018 um 05:26 PM schrieb Jan Friesse: I've tested it too and yes, you are 100% right. Bug is there and it's pretty easy to reproduce when node with lowest nodeid is paused. It's slightly harder when node with higher nodeid is paused. Do you were able to make some progress on this issue? Ya, kind of. Sadly I had to work on different problem, but I'm expecting to sent patch next week. I guess the different problems where the ones related to the issued CVEs :) Yep. Also I've spent quite a lot of the time thinking about best possible solution. CPG is quite old, it was full of weird bugs and risk of breakage is very high. Anyway, I've decided to not to try hack what is apparently broken and just go for risky but proper solution (= needs a LOT more testing, but so far looks good). Patch is in PR (needle): https://github.com/corosync/corosync/pull/347 Regards, Honza We'd really like a fix for this, so if there's anything I can do to help just hit me up. :) Testing would be welcomed. Have you anything we could test already? Our freeze for our next release is in sight, so it would be really great if we had an upstream accepted resolution for this issue until then. cheers, Thomas ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync 2.4 CPG config change callback
Hi Honza On 4/6/18 10:59 AM, Jan Friesse wrote: > Thomas Lamprecht napsal(a): >> Am 03/09/2018 um 05:26 PM schrieb Jan Friesse: >>> I've tested it too and yes, you are 100% right. Bug is there and it's >>> pretty easy to reproduce when node with lowest nodeid is paused. It's >>> slightly harder when node with higher nodeid is paused. >>> >> >> Do you were able to make some progress on this issue? > > Ya, kind of. Sadly I had to work on different problem, but I'm expecting to > sent patch next week. > I guess the different problems where the ones related to the issued CVEs :) >> We'd really like a fix for this, so if there's anything I can do to help >> just hit me up. :) > > Testing would be welcomed. > Have you anything we could test already? Our freeze for our next release is in sight, so it would be really great if we had an upstream accepted resolution for this issue until then. cheers, Thomas ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync 2.4 CPG config change callback
Hi Thomas, Thomas Lamprecht napsal(a): Hi Honza, Am 03/09/2018 um 05:26 PM schrieb Jan Friesse: Thomas, TotemConfchgCallback: ringid (1.1436) active processors 3: 1 2 3 EXIT Finalize result is 1 (should be 1) Hope I did both test right, but as it reproduces multiple times with testcpg, our cpg usage in our filesystem, this seems like valid tested, not just an single occurrence. I've tested it too and yes, you are 100% right. Bug is there and it's pretty easy to reproduce when node with lowest nodeid is paused. It's slightly harder when node with higher nodeid is paused. Do you were able to make some progress on this issue? Ya, kind of. Sadly I had to work on different problem, but I'm expecting to sent patch next week. We'd really like a fix for this, so if there's anything I can do to help just hit me up. :) Testing would be welcomed. Honza Else, I have a (little hacky) workaround here (cpg client side), if you think the issue isn't to easy to address anytime soon, I'd polish that patch up and we could use that while waiting for the real fix. cheers, Thomas ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync 2.4 CPG config change callback
Hi Honza, Am 03/09/2018 um 05:26 PM schrieb Jan Friesse: Thomas, TotemConfchgCallback: ringid (1.1436) active processors 3: 1 2 3 EXIT Finalize result is 1 (should be 1) Hope I did both test right, but as it reproduces multiple times with testcpg, our cpg usage in our filesystem, this seems like valid tested, not just an single occurrence. I've tested it too and yes, you are 100% right. Bug is there and it's pretty easy to reproduce when node with lowest nodeid is paused. It's slightly harder when node with higher nodeid is paused. Do you were able to make some progress on this issue? We'd really like a fix for this, so if there's anything I can do to help just hit me up. :) Else, I have a (little hacky) workaround here (cpg client side), if you think the issue isn't to easy to address anytime soon, I'd polish that patch up and we could use that while waiting for the real fix. cheers, Thomas ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync 2.4 CPG config change callback
On Fri, 2018-03-09 at 17:26 +0100, Jan Friesse wrote: > Thomas, > > > Hi, > > > > On 3/7/18 1:41 PM, Jan Friesse wrote: > > > Thomas, > > > > > > > First thanks for your answer! > > > > > > > > On 3/7/18 11:16 AM, Jan Friesse wrote: > > ... > > > TotemConfchgCallback: ringid (1.1436) > > active processors 3: 1 2 3 > > EXIT > > Finalize result is 1 (should be 1) > > > > > > Hope I did both test right, but as it reproduces multiple times > > with testcpg, our cpg usage in our filesystem, this seems like > > valid tested, not just an single occurrence. > > I've tested it too and yes, you are 100% right. Bug is there and > it's > pretty easy to reproduce when node with lowest nodeid is paused. > It's > slightly harder when node with higher nodeid is paused. > > Most of the clusters are using power fencing, so they simply never > sees > this problem. That may be also the reason why it wasn't reported > long > time ago (this bug exists virtually at least since OpenAIS > Whitetank). > So really nice work with finding this bug. > > What I'm not entirely sure is what may be best way to solve this > problem. What I'm sure is, that it's going to be "fun" :( > > Lets start with very high level of possible solutions: > - "Ignore the problem". CPG behaves more or less correctly. > "Current" > membership really didn't changed so it doesn't make too much sense > to > inform about change. It's possible to use cpg_totem_confchg_fn_t to > find > out when ringid changes. I'm adding this solution just for > completeness, > because I don't prefer it at all. > - cpg_confchg_fn_t adds all left and back joined into left/join list > - cpg will sends extra cpg_confchg_fn_t call about left and joined > nodes. I would prefer this solution simply because it makes cpg > behavior > equal in all situations. > > Which of the options you would prefer? Same question also for @Ken (- Pacemaker should react essentially the same whichever of the last two options is used. There could be differences due to timing (the second solution might allow some work to be done between when the left and join messages are received), but I think it should behave reasonably with either approach. Interestingly, there is some old code in Pacemaker for handling when a node left and rejoined but "the cluster layer didn't notice", that may have been a workaround for this case. > > > what would you prefer for PCMK) and @Chrissie. > > Regards, > Honza > > > > > > cheers, > > Thomas > > > > > > > > > > > Now it's really cpg application problem to synchronize its > > > > > data. Many applications (usually FS) are using quorum > > > > > together with fencing to find out, which cluster partition is > > > > > quorate and clean inquorate one. > > > > > > > > > > Hopefully my explanation help you and feel free to ask more > > > > > questions! > > > > > > > > > > > > > They help, but I'm still a bit unsure about why the CB could > > > > not happen here, > > > > may need to dive a bit deeper into corosync :) > > > > > > > > > Regards, > > > > > Honza > > > > > > > > > > > > > > > > > help would be appreciated, much thanks! > > > > > > > > > > > > cheers, > > > > > > Thomas > > > > > > > > > > > > [1]: https://git.proxmox.com/?p=pve-cluster.git;a=tree;f=da > > > > > > ta/src;h=e5493468b456ba9fe3f681f387b4cd5b86e7ca08;hb=HEAD > > > > > > [2]: https://git.proxmox.com/?p=pve-cluster.git;a=blob;f=da > > > > > > ta/src/dfsm.c;h=cdf473e8226ab9706d693a457ae70c0809afa0fa;hb > > > > > > =HEAD#l1096 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- Ken Gaillot ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync 2.4 CPG config change callback
On 09/03/18 16:26, Jan Friesse wrote: > Thomas, > >> Hi, >> >> On 3/7/18 1:41 PM, Jan Friesse wrote: >>> Thomas, >>> First thanks for your answer! On 3/7/18 11:16 AM, Jan Friesse wrote: > > ... > >> TotemConfchgCallback: ringid (1.1436) >> active processors 3: 1 2 3 >> EXIT >> Finalize result is 1 (should be 1) >> >> >> Hope I did both test right, but as it reproduces multiple times >> with testcpg, our cpg usage in our filesystem, this seems like >> valid tested, not just an single occurrence. > > I've tested it too and yes, you are 100% right. Bug is there and it's > pretty easy to reproduce when node with lowest nodeid is paused. It's > slightly harder when node with higher nodeid is paused. > > Most of the clusters are using power fencing, so they simply never sees > this problem. That may be also the reason why it wasn't reported long > time ago (this bug exists virtually at least since OpenAIS Whitetank). > So really nice work with finding this bug. > > What I'm not entirely sure is what may be best way to solve this > problem. What I'm sure is, that it's going to be "fun" :( > > Lets start with very high level of possible solutions: > - "Ignore the problem". CPG behaves more or less correctly. "Current" > membership really didn't changed so it doesn't make too much sense to > inform about change. It's possible to use cpg_totem_confchg_fn_t to find > out when ringid changes. I'm adding this solution just for completeness, > because I don't prefer it at all. > - cpg_confchg_fn_t adds all left and back joined into left/join list > - cpg will sends extra cpg_confchg_fn_t call about left and joined > nodes. I would prefer this solution simply because it makes cpg behavior > equal in all situations. > > Which of the options you would prefer? Same question also for @Ken (-> > what would you prefer for PCMK) and @Chrissie. > The last option makes most sense to me too - it's more consistent and 'what you would expect' I think. Chrissie ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync 2.4 CPG config change callback
Hi, On 3/9/18 5:26 PM, Jan Friesse wrote: > ... > >> TotemConfchgCallback: ringid (1.1436) >> active processors 3: 1 2 3 >> EXIT >> Finalize result is 1 (should be 1) >> >> >> Hope I did both test right, but as it reproduces multiple times >> with testcpg, our cpg usage in our filesystem, this seems like >> valid tested, not just an single occurrence. > > I've tested it too and yes, you are 100% right. Bug is there and it's pretty > easy to reproduce when node with lowest nodeid is paused. It's slightly > harder when node with higher nodeid is paused. Good, so we're not crazy :) > > Most of the clusters are using power fencing, so they simply never sees this > problem. That may be also the reason why it wasn't reported long time ago > (this bug exists virtually at least since OpenAIS Whitetank). So really nice > work with finding this bug. > Hmm, but even slow pauses (1 to 2 seconds) cause this, so fencing should get active there yet. We here had a theory that environment changes let this bug trigger more often, i.e. scheduler, IO subsystem changes in the Kernel, for example. As we saw a significant raise of reports in the recent years. (I mean we grow in users too, but the increase feels not just like correlation). > What I'm not entirely sure is what may be best way to solve this problem. > What I'm sure is, that it's going to be "fun" :( > > Lets start with very high level of possible solutions: > - "Ignore the problem". CPG behaves more or less correctly. "Current" > membership really didn't changed so it doesn't make too much sense to inform > about change. It's possible to use cpg_totem_confchg_fn_t to find out when > ringid changes. I'm adding this solution just for completeness, because I > don't prefer it at all. same here, I mean we could work around this, but it does not really feels right. And our code is designed with the assumption that we get a membership callback, changing that assumption seems like a bit of a headache as we need to verify that no side effects gets introduced by the workaround and everything can cope with it. Doable, but also not to much fun :) > - cpg_confchg_fn_t adds all left and back joined into left/join list would work for us. > - cpg will sends extra cpg_confchg_fn_t call about left and joined nodes. I > would prefer this solution simply because it makes cpg behavior equal in all > situations. > So the behaviour you assumed it should do? Getting two callbacks, one that all others left and then the one where all others joined in the new membership? This sounds like the best approach to me, as it really tells the CPG application what happened in the way all other members see it. But I'm not an corosync guru :) > Which of the options you would prefer? Same question also for @Ken (-> what > would you prefer for PCMK) and @Chrissie. > The last approach. cheers and much thanks for your help! Thomas ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync 2.4 CPG config change callback
Thomas, Hi, On 3/7/18 1:41 PM, Jan Friesse wrote: Thomas, First thanks for your answer! On 3/7/18 11:16 AM, Jan Friesse wrote: ... TotemConfchgCallback: ringid (1.1436) active processors 3: 1 2 3 EXIT Finalize result is 1 (should be 1) Hope I did both test right, but as it reproduces multiple times with testcpg, our cpg usage in our filesystem, this seems like valid tested, not just an single occurrence. I've tested it too and yes, you are 100% right. Bug is there and it's pretty easy to reproduce when node with lowest nodeid is paused. It's slightly harder when node with higher nodeid is paused. Most of the clusters are using power fencing, so they simply never sees this problem. That may be also the reason why it wasn't reported long time ago (this bug exists virtually at least since OpenAIS Whitetank). So really nice work with finding this bug. What I'm not entirely sure is what may be best way to solve this problem. What I'm sure is, that it's going to be "fun" :( Lets start with very high level of possible solutions: - "Ignore the problem". CPG behaves more or less correctly. "Current" membership really didn't changed so it doesn't make too much sense to inform about change. It's possible to use cpg_totem_confchg_fn_t to find out when ringid changes. I'm adding this solution just for completeness, because I don't prefer it at all. - cpg_confchg_fn_t adds all left and back joined into left/join list - cpg will sends extra cpg_confchg_fn_t call about left and joined nodes. I would prefer this solution simply because it makes cpg behavior equal in all situations. Which of the options you would prefer? Same question also for @Ken (-> what would you prefer for PCMK) and @Chrissie. Regards, Honza cheers, Thomas Now it's really cpg application problem to synchronize its data. Many applications (usually FS) are using quorum together with fencing to find out, which cluster partition is quorate and clean inquorate one. Hopefully my explanation help you and feel free to ask more questions! They help, but I'm still a bit unsure about why the CB could not happen here, may need to dive a bit deeper into corosync :) Regards, Honza help would be appreciated, much thanks! cheers, Thomas [1]: https://git.proxmox.com/?p=pve-cluster.git;a=tree;f=data/src;h=e5493468b456ba9fe3f681f387b4cd5b86e7ca08;hb=HEAD [2]: https://git.proxmox.com/?p=pve-cluster.git;a=blob;f=data/src/dfsm.c;h=cdf473e8226ab9706d693a457ae70c0809afa0fa;hb=HEAD#l1096 ___ Users mailing list: Users@clusterlabs.org https://lists.clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] corosync 2.4 CPG config change callback
Hi, On 3/7/18 1:41 PM, Jan Friesse wrote: > Thomas, > >> First thanks for your answer! >> >> On 3/7/18 11:16 AM, Jan Friesse wrote: first some background info for my questions I'm going to ask: We use corosync as a basis for our distributed realtime configuration file system (pmxcfs)[1]. >>> >>> nice >>> We got some reports of a completely hanging FS with the only correlations being high load, often IO, and most times a message that corosync did not got scheduled for longer than the token timeout. See this example from a three node cluster, first: > Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [MAIN ] > Corosync main process was not scheduled for 3767.3159 ms (threshold is > 1320. ms). Consider token timeout increase. then we get a few logs that JOIN or LEAVE messages were thrown away (understandable for this event): Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [MAIN ] Corosync main process was not scheduled for 3767.3159 ms (threshold is 1320. ms). Consider token timeout increase. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [TOTEM ] Failed to receive the leave message. failed: 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] Failed to receive the leave message. failed: 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [QUORUM] Members[3]: 1 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [MAIN ] Completed service synchronization, ready to provide service. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [QUORUM] Members[3]: 1 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: [MAIN ] Completed service synchronization, ready to provide service. Until recently we stepped really in the dark and had everything from Kernel bugs to our filesystem logic as possible cause in mind... But then we had the luck to trigger this in our test systems and went to town with gdb on the core dump, finding that we can trigger this by pausing the leader (from our FS POV) for a short moment (may be shorter than the token timeout), so that a new leader get elected, and then resuming our leader node VM again. The problem I saw was that while the leader had a log entry which proved that he noticed his blackout: > [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members > joined: 2 3 left: 2 3 >>> >>> I know it looks weird but it's perfectly fine and expected. >>> >> >> It seemed OK, from this nodes POV, just the missing config change CB >> was a bit odd to us. >> our FS cpg_confchg_fn callback[2] was never called, thus it thought it >>> >>> That shouldn't happen >>> >> >> So we really should get a config change CB on the paused node after >> unpausing, with all other (online) nodes in both leave and join member >> list? > > Nope, cpg will take care to sent two messages (one is about left node and > second is about join node). > >> Just asking again to confirm my thinking and that I did not misunderstood >> you. :) >> was still in sync and nothing ever happened, until another member triggered this callback, by either leaving or (re-)joining. Looking in the cpg.c code I saw that there's another callback, namely cpg_totem_confchg_fn, which seemed a bit odd as wew did not set that >>> >>> This callback is not necessary to have as long as information about cpg >>> group is enough. cpg_totem_confchg_fn contains information about all >>> processors (nodes). >>> >> >> OK, make sense. >> one... (I ain't the original author of the FS and it predates at least to 2010, so maybe cpg_initialize was not yet deprecated there, and thus model_initialize was not used then) >>> I
Re: [ClusterLabs] corosync 2.4 CPG config change callback
Thomas, First thanks for your answer! On 3/7/18 11:16 AM, Jan Friesse wrote: Thomas, Hi, first some background info for my questions I'm going to ask: We use corosync as a basis for our distributed realtime configuration file system (pmxcfs)[1]. nice We got some reports of a completely hanging FS with the only correlations being high load, often IO, and most times a message that corosync did not got scheduled for longer than the token timeout. See this example from a three node cluster, first: Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [MAIN ] Corosync main process was not scheduled for 3767.3159 ms (threshold is 1320. ms). Consider token timeout increase. then we get a few logs that JOIN or LEAVE messages were thrown away (understandable for this event): Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [MAIN ] Corosync main process was not scheduled for 3767.3159 ms (threshold is 1320. ms). Consider token timeout increase. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [TOTEM ] Failed to receive the leave message. failed: 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] Failed to receive the leave message. failed: 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [QUORUM] Members[3]: 1 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [MAIN ] Completed service synchronization, ready to provide service. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [QUORUM] Members[3]: 1 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: [MAIN ] Completed service synchronization, ready to provide service. Until recently we stepped really in the dark and had everything from Kernel bugs to our filesystem logic as possible cause in mind... But then we had the luck to trigger this in our test systems and went to town with gdb on the core dump, finding that we can trigger this by pausing the leader (from our FS POV) for a short moment (may be shorter than the token timeout), so that a new leader get elected, and then resuming our leader node VM again. The problem I saw was that while the leader had a log entry which proved that he noticed his blackout: [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3 I know it looks weird but it's perfectly fine and expected. It seemed OK, from this nodes POV, just the missing config change CB was a bit odd to us. our FS cpg_confchg_fn callback[2] was never called, thus it thought it That shouldn't happen So we really should get a config change CB on the paused node after unpausing, with all other (online) nodes in both leave and join member list? Nope, cpg will take care to sent two messages (one is about left node and second is about join node). Just asking again to confirm my thinking and that I did not misunderstood you. :) was still in sync and nothing ever happened, until another member triggered this callback, by either leaving or (re-)joining. Looking in the cpg.c code I saw that there's another callback, namely cpg_totem_confchg_fn, which seemed a bit odd as wew did not set that This callback is not necessary to have as long as information about cpg group is enough. cpg_totem_confchg_fn contains information about all processors (nodes). OK, make sense. one... (I ain't the original author of the FS and it predates at least to 2010, so maybe cpg_initialize was not yet deprecated there, and thus model_initialize was not used then) I switched over to using cpg_model_initialize and set the totem_confchg callback, but for the "blacked out node" it gets called twice after the event, but always shows all members... So to finally get to my questions: * why doesn't get the cpg_confchg_fn CB called when a node has a short blackout (i.e., corosync not being scheduled for a bit of time)? having all other nodes in it's leave and join list, as the log would suggests ("Members joined: 2 3 left: 2 3")
Re: [ClusterLabs] corosync 2.4 CPG config change callback
First thanks for your answer! On 3/7/18 11:16 AM, Jan Friesse wrote: > Thomas, > > >> Hi, >> >> first some background info for my questions I'm going to ask: >> We use corosync as a basis for our distributed realtime configuration >> file system (pmxcfs)[1]. > > nice > >> >> We got some reports of a completely hanging FS with the only >> correlations being high load, often IO, and most times a message that >> corosync did not got scheduled for longer than the token timeout. >> >> See this example from a three node cluster, first: >> >>> Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [MAIN ] Corosync >>> main process was not scheduled for 3767.3159 ms (threshold is 1320. >>> ms). Consider token timeout increase. >> >> then we get a few logs that JOIN or LEAVE messages were thrown away >> (understandable for this event): >> >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [TOTEM ] JOIN or >> LEAVE message was thrown away during flush operation. >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [MAIN ] Corosync main >> process was not scheduled for 3767.3159 ms (threshold is 1320. ms). >> Consider token timeout increase. >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE >> message was thrown away during flush operation. >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE >> message was thrown away during flush operation. >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE >> message was thrown away during flush operation. >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE >> message was thrown away during flush operation. >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE >> message was thrown away during flush operation. >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [TOTEM ] A new >> membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3 >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [TOTEM ] Failed to >> receive the leave message. failed: 2 3 >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] A new membership >> (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3 >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] Failed to receive >> the leave message. failed: 2 3 >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [QUORUM] >> Members[3]: 1 2 3 >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [MAIN ] Completed >> service synchronization, ready to provide service. >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [QUORUM] Members[3]: 1 2 3 >> Mar 01 13:07:56 ceph05-01-public corosync[1638]: [MAIN ] Completed service >> synchronization, ready to provide service. >> >> Until recently we stepped really in the dark and had everything from >> Kernel bugs to our filesystem logic as possible cause in mind... But >> then we had the luck to trigger this in our test systems and went to >> town with gdb on the core dump, finding that we can trigger this by >> pausing the leader (from our FS POV) for a short moment (may be shorter >> than the token timeout), so that a new leader get elected, and then >> resuming our leader node VM again. >> >> The problem I saw was that while the leader had a log entry which >> proved that he noticed his blackout: >>> [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: >>> 2 3 left: 2 3 > > I know it looks weird but it's perfectly fine and expected. > It seemed OK, from this nodes POV, just the missing config change CB was a bit odd to us. >> >> our FS cpg_confchg_fn callback[2] was never called, thus it thought it > > That shouldn't happen > So we really should get a config change CB on the paused node after unpausing, with all other (online) nodes in both leave and join member list? Just asking again to confirm my thinking and that I did not misunderstood you. :) >> was still in sync and nothing ever happened, until another member >> triggered this callback, by either leaving or (re-)joining. >> >> Looking in the cpg.c code I saw that there's another callback, namely >> cpg_totem_confchg_fn, which seemed a bit odd as wew did not set that > > This callback is not necessary to have as long as information about cpg group > is enough. cpg_totem_confchg_fn contains information about all processors > (nodes). > OK, make sense. >> one... (I ain't the original author of the FS and it predates at least >> to 2010, so maybe cpg_initialize was not yet deprecated there, and >> thus model_initialize was not used then) > >> >> I switched over to using cpg_model_initialize and set the totem_confchg >> callback, but for the "blacked out node" it gets called twice after the >> event, but always shows all members... >> >> So to finally get to my questions: >> >> * why doesn't get the cpg_confchg_fn CB called when a node has a short >> blackout (i.e., corosync not being scheduled for
Re: [ClusterLabs] corosync 2.4 CPG config change callback
Thomas, Hi, first some background info for my questions I'm going to ask: We use corosync as a basis for our distributed realtime configuration file system (pmxcfs)[1]. nice We got some reports of a completely hanging FS with the only correlations being high load, often IO, and most times a message that corosync did not got scheduled for longer than the token timeout. See this example from a three node cluster, first: Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [MAIN ] Corosync main process was not scheduled for 3767.3159 ms (threshold is 1320. ms). Consider token timeout increase. then we get a few logs that JOIN or LEAVE messages were thrown away (understandable for this event): Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [MAIN ] Corosync main process was not scheduled for 3767.3159 ms (threshold is 1320. ms). Consider token timeout increase. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [TOTEM ] Failed to receive the leave message. failed: 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] Failed to receive the leave message. failed: 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [QUORUM] Members[3]: 1 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [MAIN ] Completed service synchronization, ready to provide service. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [QUORUM] Members[3]: 1 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: [MAIN ] Completed service synchronization, ready to provide service. Until recently we stepped really in the dark and had everything from Kernel bugs to our filesystem logic as possible cause in mind... But then we had the luck to trigger this in our test systems and went to town with gdb on the core dump, finding that we can trigger this by pausing the leader (from our FS POV) for a short moment (may be shorter than the token timeout), so that a new leader get elected, and then resuming our leader node VM again. The problem I saw was that while the leader had a log entry which proved that he noticed his blackout: [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3 I know it looks weird but it's perfectly fine and expected. our FS cpg_confchg_fn callback[2] was never called, thus it thought it That shouldn't happen was still in sync and nothing ever happened, until another member triggered this callback, by either leaving or (re-)joining. Looking in the cpg.c code I saw that there's another callback, namely cpg_totem_confchg_fn, which seemed a bit odd as wew did not set that This callback is not necessary to have as long as information about cpg group is enough. cpg_totem_confchg_fn contains information about all processors (nodes). one... (I ain't the original author of the FS and it predates at least to 2010, so maybe cpg_initialize was not yet deprecated there, and thus model_initialize was not used then) I switched over to using cpg_model_initialize and set the totem_confchg callback, but for the "blacked out node" it gets called twice after the event, but always shows all members... So to finally get to my questions: * why doesn't get the cpg_confchg_fn CB called when a node has a short blackout (i.e., corosync not being scheduled for a bit of time)? having all other nodes in it's leave and join list, as the log would suggests ("Members joined: 2 3 left: 2 3") I believe it was called but not when corosync was paused. * If that doesn't seems like a good idea, what can we use to really detect such a node blackout? It's not possible to detect from the affected node, but it must be detected from other nodes. As a work around I added logic for when through a config change a node with a lower ID joined. The node which was leader until then triggers a CPG leave enforcing a cluster wide config change event to happen, which this time also the blacked out node gets and syncs th
[ClusterLabs] corosync 2.4 CPG config change callback
Hi, first some background info for my questions I'm going to ask: We use corosync as a basis for our distributed realtime configuration file system (pmxcfs)[1]. We got some reports of a completely hanging FS with the only correlations being high load, often IO, and most times a message that corosync did not got scheduled for longer than the token timeout. See this example from a three node cluster, first: > Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [MAIN ] Corosync > main process was not scheduled for 3767.3159 ms (threshold is 1320. ms). > Consider token timeout increase. then we get a few logs that JOIN or LEAVE messages were thrown away (understandable for this event): Mar 01 13:07:56 ceph05-01-public corosync[1638]: warning [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [MAIN ] Corosync main process was not scheduled for 3767.3159 ms (threshold is 1320. ms). Consider token timeout increase. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] JOIN or LEAVE message was thrown away during flush operation. Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [TOTEM ] Failed to receive the leave message. failed: 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 3 left: 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: [TOTEM ] Failed to receive the leave message. failed: 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [QUORUM] Members[3]: 1 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: notice [MAIN ] Completed service synchronization, ready to provide service. Mar 01 13:07:56 ceph05-01-public corosync[1638]: [QUORUM] Members[3]: 1 2 3 Mar 01 13:07:56 ceph05-01-public corosync[1638]: [MAIN ] Completed service synchronization, ready to provide service. Until recently we stepped really in the dark and had everything from Kernel bugs to our filesystem logic as possible cause in mind... But then we had the luck to trigger this in our test systems and went to town with gdb on the core dump, finding that we can trigger this by pausing the leader (from our FS POV) for a short moment (may be shorter than the token timeout), so that a new leader get elected, and then resuming our leader node VM again. The problem I saw was that while the leader had a log entry which proved that he noticed his blackout: > [TOTEM ] A new membership (192.168.21.51:2324) was formed. Members joined: 2 > 3 left: 2 3 our FS cpg_confchg_fn callback[2] was never called, thus it thought it was still in sync and nothing ever happened, until another member triggered this callback, by either leaving or (re-)joining. Looking in the cpg.c code I saw that there's another callback, namely cpg_totem_confchg_fn, which seemed a bit odd as wew did not set that one... (I ain't the original author of the FS and it predates at least to 2010, so maybe cpg_initialize was not yet deprecated there, and thus model_initialize was not used then) I switched over to using cpg_model_initialize and set the totem_confchg callback, but for the "blacked out node" it gets called twice after the event, but always shows all members... So to finally get to my questions: * why doesn't get the cpg_confchg_fn CB called when a node has a short blackout (i.e., corosync not being scheduled for a bit of time)? having all other nodes in it's leave and join list, as the log would suggests ("Members joined: 2 3 left: 2 3") * If that doesn't seems like a good idea, what can we use to really detect such a node blackout? As a work around I added logic for when through a config change a node with a lower ID joined. The node which was leader until then triggers a CPG leave enforcing a cluster wide config change event to happen, which this time also the blacked out node gets and syncs then again This works, but does not feels really nice, IMO... help would be appreciated, much thanks! cheers, Thomas [1]: https://git.proxmox.com/?p=pve-cluster.git;a=tree;f=data/src;h=e5493468b456ba9fe3f681f387b4cd5b86e7ca08;hb=HEAD [2]: https://git.proxmox.com/?p=pve-cluster.git;a=blob;f=data/src/dfsm.c;h=cdf473e8226ab9706d693a457ae70c0809afa0fa;hb=HEAD#l1096 ___ Users mailing list: U