Re: [ClusterLabs] Reliability questions on the new QDevices in uneven node count Setups
On 25/07/16 16:27, Klaus Wenninger wrote: > On 07/25/2016 04:56 PM, Thomas Lamprecht wrote: >> Thanks for the fast reply :) >> >> >> On 07/25/2016 03:51 PM, Christine Caulfield wrote: >>> On 25/07/16 14:29, Thomas Lamprecht wrote: Hi all, I'm currently testing the new features of corosync 2.4, especially qdevices. First tests show quite nice results, like having quorum on a single node left out of a three node cluster. But what I'm a bit worrying about is what happens if the server where qnetd runs, or the qdevice daemon fails, in this case the cluster cannot afford any other loss of a node in my three node setup as votes expected are 5 and thus 3 are needed for quorum, which I cannot fulfill if the qnetd does not run run or failed. >>> We're looking into ways of making this more resilient. It might be >>> possible to cluster a qnetd (though this is not currently supported) in >>> a separate cluster from the arbitrated one, obviously. >> >> Yeah I saw that in the QDevice document, that would be a way. >> >> Would then the qnetd daemons act like an own cluster I guess, as there >> would be a need to communicate which node sees which qnetd daemon? >> So that a decision about the quorate partition can be made. >> >> But it's always binding the reliability of a cluster to the one of a >> node, adding a dependency, >> meaning that now failures of components outside from the cluster, >> which would else have >> no affect on the cluster behaviour may now affect it, which could be a >> problem? >> >> I know that's worst case scenario but with only one qnetd running on a >> single (external) node >> it can happen, and if the reliability of the node running qnetd is the >> same as the one from each cluster node >> the reliability of the whole cluster in a three node case would be >> quite simplified, if I remember my introduction course to this topic >> somewhat correctly: >> >> Without qnetd: 1 - ( (1 - R1) * (1 - R2) * (1 - R3)) >> >> With qnetd: (1 - ( (1 - R1) * (1 - R2) * (1 - R3)) ) * Rqnetd >> >> Where R1, R2, R3 are the reliabilities of the cluster nodes and >> Rqnetd is the reliability of the node running qnetd. >> While thats a really really simplified model, not quite correctly >> depict reallity, the base concept that the reliability >> of the whole cluster gets dependent of the one from the node running >> qnetd, or? >> > With lms and ffsplit I guess the calculation is not that simple anymore ... > > correct me if I'm wrong but I think a bottomline to understanding the > benefits of qdevice is to think of the classic quorum-generation taking > basically a snapshot of the situation at a certain time and deriving the > reactions from that - whereas with qdevice it is tried to benefit from > the knowledge of the past (respectively how we got into the current > situation). > Actually no. qdevice is totally stateless (which is why I think it will cluster well when we get round to it). it makes the best decision it can based on the fact that it should have a full view of all nodes in the cluster regardless of whether they can see each other - and if they can't see qdevice then they don't get the vote anyway. Chrissie ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Reliability questions on the new QDevices in uneven node count Setups
On 07/25/2016 04:56 PM, Thomas Lamprecht wrote: > Thanks for the fast reply :) > > > On 07/25/2016 03:51 PM, Christine Caulfield wrote: >> On 25/07/16 14:29, Thomas Lamprecht wrote: >>> Hi all, >>> >>> I'm currently testing the new features of corosync 2.4, especially >>> qdevices. >>> First tests show quite nice results, like having quorum on a single >>> node >>> left out of a three node cluster. >>> >>> But what I'm a bit worrying about is what happens if the server where >>> qnetd runs, or the qdevice daemon fails, >>> in this case the cluster cannot afford any other loss of a node in my >>> three node setup as votes expected are >>> 5 and thus 3 are needed for quorum, which I cannot fulfill if the qnetd >>> does not run run or failed. >> We're looking into ways of making this more resilient. It might be >> possible to cluster a qnetd (though this is not currently supported) in >> a separate cluster from the arbitrated one, obviously. > > Yeah I saw that in the QDevice document, that would be a way. > > Would then the qnetd daemons act like an own cluster I guess, as there > would be a need to communicate which node sees which qnetd daemon? > So that a decision about the quorate partition can be made. > > But it's always binding the reliability of a cluster to the one of a > node, adding a dependency, > meaning that now failures of components outside from the cluster, > which would else have > no affect on the cluster behaviour may now affect it, which could be a > problem? > > I know that's worst case scenario but with only one qnetd running on a > single (external) node > it can happen, and if the reliability of the node running qnetd is the > same as the one from each cluster node > the reliability of the whole cluster in a three node case would be > quite simplified, if I remember my introduction course to this topic > somewhat correctly: > > Without qnetd: 1 - ( (1 - R1) * (1 - R2) * (1 - R3)) > > With qnetd: (1 - ( (1 - R1) * (1 - R2) * (1 - R3)) ) * Rqnetd > > Where R1, R2, R3 are the reliabilities of the cluster nodes and > Rqnetd is the reliability of the node running qnetd. > While thats a really really simplified model, not quite correctly > depict reallity, the base concept that the reliability > of the whole cluster gets dependent of the one from the node running > qnetd, or? > With lms and ffsplit I guess the calculation is not that simple anymore ... correct me if I'm wrong but I think a bottomline to understanding the benefits of qdevice is to think of the classic quorum-generation taking basically a snapshot of the situation at a certain time and deriving the reactions from that - whereas with qdevice it is tried to benefit from the knowledge of the past (respectively how we got into the current situation). >> >> The LMS algorithm is quite smart about how it doles out its vote and can >> handle isolation from the main qnetd provided that the main core of the >> cluster (the majority in a split) retains quorum, but any more serious >> changes to the cluster config will cause it to be withdrawn. So in this >> case you should find that your 3 node cluster will continue to work in >> the absence of the qnetd server or link, provided you don't lose any >> nodes. > > Yes I read that in the documents and saw that during testing also, > really good work! > > My point of my mail was exactly the failure of qnetd itself and the > resulting situation that the cluster > then cannot afford to loose any node, while without qnetd it could > afford to loose (n - 1) / 2 nodes. > > Or do I have also to enable quorum.last_man_standing together with > quorum.wait_for_all to allowing down scaling of the expected votes if > qnetd fails completely? > I will test that. > > I'm just wanting to be sure if my thoughts are correct, or at least > not completely flawed and that > qnetd like it is makes sense in a even node count cluster with the > ffsplit algorithm but not in an > uneven node count cluster, if the reliability of the node running > qnetd cannot be guaranteed, > i.e. adding HA to the service (VM or container) running qnetd. > > best regards, > Thomas > >> >> In a 3 node setup obviously LMS is more appropriate than ffsplit anyway. >> >> Chrissie >> >>> So in this case I'm bound to the reliability of the server providing >>> the >>> qnetd service, >>> if it fails I cannot afford to loose any other node in my three node >>> example, >>> or also in any other example with uneven node count as the qdevice vote >>> subsystems provides node count -1 votes. >>> >>> So if I see it correctly QDevices make only sense in case of even node >>> counts, >>> maybe especially 2 node setups as if qnetd works we have on more node >>> which may fail and if qnetd failed >>> we are as good as without it as qnted provides only one vote here. >>> >>> Am I missing something, or any thoughts to that? >>> >>> >>> >>> ___ >>> Users mailing list: Users@clusterlabs.org >>> http://cl
Re: [ClusterLabs] Reliability questions on the new QDevices in uneven node count Setups
Thanks for the fast reply :) On 07/25/2016 03:51 PM, Christine Caulfield wrote: On 25/07/16 14:29, Thomas Lamprecht wrote: Hi all, I'm currently testing the new features of corosync 2.4, especially qdevices. First tests show quite nice results, like having quorum on a single node left out of a three node cluster. But what I'm a bit worrying about is what happens if the server where qnetd runs, or the qdevice daemon fails, in this case the cluster cannot afford any other loss of a node in my three node setup as votes expected are 5 and thus 3 are needed for quorum, which I cannot fulfill if the qnetd does not run run or failed. We're looking into ways of making this more resilient. It might be possible to cluster a qnetd (though this is not currently supported) in a separate cluster from the arbitrated one, obviously. Yeah I saw that in the QDevice document, that would be a way. Would then the qnetd daemons act like an own cluster I guess, as there would be a need to communicate which node sees which qnetd daemon? So that a decision about the quorate partition can be made. But it's always binding the reliability of a cluster to the one of a node, adding a dependency, meaning that now failures of components outside from the cluster, which would else have no affect on the cluster behaviour may now affect it, which could be a problem? I know that's worst case scenario but with only one qnetd running on a single (external) node it can happen, and if the reliability of the node running qnetd is the same as the one from each cluster node the reliability of the whole cluster in a three node case would be quite simplified, if I remember my introduction course to this topic somewhat correctly: Without qnetd: 1 - ( (1 - R1) * (1 - R2) * (1 - R3)) With qnetd: (1 - ( (1 - R1) * (1 - R2) * (1 - R3)) ) * Rqnetd Where R1, R2, R3 are the reliabilities of the cluster nodes and Rqnetd is the reliability of the node running qnetd. While thats a really really simplified model, not quite correctly depict reallity, the base concept that the reliability of the whole cluster gets dependent of the one from the node running qnetd, or? The LMS algorithm is quite smart about how it doles out its vote and can handle isolation from the main qnetd provided that the main core of the cluster (the majority in a split) retains quorum, but any more serious changes to the cluster config will cause it to be withdrawn. So in this case you should find that your 3 node cluster will continue to work in the absence of the qnetd server or link, provided you don't lose any nodes. Yes I read that in the documents and saw that during testing also, really good work! My point of my mail was exactly the failure of qnetd itself and the resulting situation that the cluster then cannot afford to loose any node, while without qnetd it could afford to loose (n - 1) / 2 nodes. Or do I have also to enable quorum.last_man_standing together with quorum.wait_for_all to allowing down scaling of the expected votes if qnetd fails completely? I will test that. I'm just wanting to be sure if my thoughts are correct, or at least not completely flawed and that qnetd like it is makes sense in a even node count cluster with the ffsplit algorithm but not in an uneven node count cluster, if the reliability of the node running qnetd cannot be guaranteed, i.e. adding HA to the service (VM or container) running qnetd. best regards, Thomas In a 3 node setup obviously LMS is more appropriate than ffsplit anyway. Chrissie So in this case I'm bound to the reliability of the server providing the qnetd service, if it fails I cannot afford to loose any other node in my three node example, or also in any other example with uneven node count as the qdevice vote subsystems provides node count -1 votes. So if I see it correctly QDevices make only sense in case of even node counts, maybe especially 2 node setups as if qnetd works we have on more node which may fail and if qnetd failed we are as good as without it as qnted provides only one vote here. Am I missing something, or any thoughts to that? ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bug
Re: [ClusterLabs] Reliability questions on the new QDevices in uneven node count Setups
On 25/07/16 14:51, Christine Caulfield wrote: > On 25/07/16 14:29, Thomas Lamprecht wrote: >> Hi all, >> >> I'm currently testing the new features of corosync 2.4, especially >> qdevices. >> First tests show quite nice results, like having quorum on a single node >> left out of a three node cluster. >> >> But what I'm a bit worrying about is what happens if the server where >> qnetd runs, or the qdevice daemon fails, >> in this case the cluster cannot afford any other loss of a node in my >> three node setup as votes expected are >> 5 and thus 3 are needed for quorum, which I cannot fulfill if the qnetd >> does not run run or failed. > > We're looking into ways of making this more resilient. It might be > possible to cluster a qnetd (though this is not currently supported) in > a separate cluster from the arbitrated one, obviously. > > The LMS algorithm is quite smart about how it doles out its vote and can > handle isolation from the main qnetd provided that the main core of the > cluster (the majority in a split) retains quorum, but any more serious > changes to the cluster config will cause it to be withdrawn. So in this > case you should find that your 3 node cluster will continue to work in > the absence of the qnetd server or link, provided you don't lose any nodes. > I should have also said that you'll need to enable 'wait_for_all' for this to work. Chrissie > In a 3 node setup obviously LMS is more appropriate than ffsplit anyway. > > Chrissie > >> >> So in this case I'm bound to the reliability of the server providing the >> qnetd service, >> if it fails I cannot afford to loose any other node in my three node >> example, >> or also in any other example with uneven node count as the qdevice vote >> subsystems provides node count -1 votes. >> >> So if I see it correctly QDevices make only sense in case of even node >> counts, >> maybe especially 2 node setups as if qnetd works we have on more node >> which may fail and if qnetd failed >> we are as good as without it as qnted provides only one vote here. >> >> Am I missing something, or any thoughts to that? >> >> >> >> ___ >> Users mailing list: Users@clusterlabs.org >> http://clusterlabs.org/mailman/listinfo/users >> >> Project Home: http://www.clusterlabs.org >> Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf >> Bugs: http://bugs.clusterlabs.org > > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org > ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
Re: [ClusterLabs] Reliability questions on the new QDevices in uneven node count Setups
On 25/07/16 14:29, Thomas Lamprecht wrote: > Hi all, > > I'm currently testing the new features of corosync 2.4, especially > qdevices. > First tests show quite nice results, like having quorum on a single node > left out of a three node cluster. > > But what I'm a bit worrying about is what happens if the server where > qnetd runs, or the qdevice daemon fails, > in this case the cluster cannot afford any other loss of a node in my > three node setup as votes expected are > 5 and thus 3 are needed for quorum, which I cannot fulfill if the qnetd > does not run run or failed. We're looking into ways of making this more resilient. It might be possible to cluster a qnetd (though this is not currently supported) in a separate cluster from the arbitrated one, obviously. The LMS algorithm is quite smart about how it doles out its vote and can handle isolation from the main qnetd provided that the main core of the cluster (the majority in a split) retains quorum, but any more serious changes to the cluster config will cause it to be withdrawn. So in this case you should find that your 3 node cluster will continue to work in the absence of the qnetd server or link, provided you don't lose any nodes. In a 3 node setup obviously LMS is more appropriate than ffsplit anyway. Chrissie > > So in this case I'm bound to the reliability of the server providing the > qnetd service, > if it fails I cannot afford to loose any other node in my three node > example, > or also in any other example with uneven node count as the qdevice vote > subsystems provides node count -1 votes. > > So if I see it correctly QDevices make only sense in case of even node > counts, > maybe especially 2 node setups as if qnetd works we have on more node > which may fail and if qnetd failed > we are as good as without it as qnted provides only one vote here. > > Am I missing something, or any thoughts to that? > > > > ___ > Users mailing list: Users@clusterlabs.org > http://clusterlabs.org/mailman/listinfo/users > > Project Home: http://www.clusterlabs.org > Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf > Bugs: http://bugs.clusterlabs.org ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org
[ClusterLabs] Reliability questions on the new QDevices in uneven node count Setups
Hi all, I'm currently testing the new features of corosync 2.4, especially qdevices. First tests show quite nice results, like having quorum on a single node left out of a three node cluster. But what I'm a bit worrying about is what happens if the server where qnetd runs, or the qdevice daemon fails, in this case the cluster cannot afford any other loss of a node in my three node setup as votes expected are 5 and thus 3 are needed for quorum, which I cannot fulfill if the qnetd does not run run or failed. So in this case I'm bound to the reliability of the server providing the qnetd service, if it fails I cannot afford to loose any other node in my three node example, or also in any other example with uneven node count as the qdevice vote subsystems provides node count -1 votes. So if I see it correctly QDevices make only sense in case of even node counts, maybe especially 2 node setups as if qnetd works we have on more node which may fail and if qnetd failed we are as good as without it as qnted provides only one vote here. Am I missing something, or any thoughts to that? best regards, Thomas ___ Users mailing list: Users@clusterlabs.org http://clusterlabs.org/mailman/listinfo/users Project Home: http://www.clusterlabs.org Getting started: http://www.clusterlabs.org/doc/Cluster_from_Scratch.pdf Bugs: http://bugs.clusterlabs.org