Thanks for the detailed explaination, Kishore! On Jan 4, 2018 16:29, "kishore g" <[email protected]> wrote:
> Thats right SEMI_AUTO will only change the role of the replica. It will > never move the replicas. > > Instead of answering each question, I will try to explain what happens > under the hood. > > - Each participant maintains a persistent connection with Zookeeper and > sends the heartbeat every X seconds. I think this is called tick time. > - When the participant fails to send heartbeat, there is a disconnect > callback from local ZK client code. Note this callback does not come from > the ZK server, it will occur as soon as the participant fails to send the > heartbeat. > - Let's say the participant connects back to ZK after period T. Now there > are two cases. > > - T < session timeout: In this case, the participant gets "connected" > callback and its session is still valid and nothing has changed from ZK > server/Helix controller/spectator point of view. > - T > session timeout: This is when the participant gets "session > expiry" callback from the ZK. Note that this happens only after the > participant reconnects to ZK. So it might be minutes or even hours > (depending on the cause of disconnection from ZK) before the participant > gets this call back. But outside world - ZK Server/Controller/Spectator > will > know about the session expiry immediately after the session timeout. > > > Helix gets to know about the session expiry and will initiate a mastership > transfer from old master to new master. It cannot send any Master - Slave > transition message to the old master because the old master is disconnected > from ZK and is unreachable. Helix will automatically change the external > View to reflect that the old master is offline for all the replicas it > owns. The clients (spectators) will immediately know about this and they > can stop sending requests to the old master. > > Similarly, once the new master processes the slave to master transition is > successful, the external view will be updated and the clients (spectators) > can now start routing the requests to the new master. > > As you pointed out in your email, you can start a timer in participant > after you get a disconnected event and after session timeout time, stop > processing the requests. We could have done this automatically in Helix but > it really depends on the application. This is typically needed only in > master-slave state model and we could not come up with automatic way. But > we could have potentially done this based on a config variable. It will be > awesome if you can contribute this feature. > > The controller will change all the relevant data structures in ZK when the > node goes down (session expires). There is no need for any extra work here. > > Thanks, > Kishore G > > On Tue, Jan 2, 2018 at 7:03 PM, Bo Liu <[email protected]> wrote: > >> Hi Kishore, >> >> Thanks for the answers. >> >> My understanding is that Helix with SEMI AUTO mode won't change shard >> mapping autotically, but may change the roles of each replica. Please >> correct me if this is wrong. >> I am wondering how will SEMI AUTO Helix change the roles of replicas >> mastered on a participant whose ZK session is just expired? Ideally, we >> want to first 1) change the role of master replicas on the expired >> participant to Slave, and then 2) promote some other live participant to be >> the new Masters for those partitions. >> For 1), we can add some timer logic at the participant side to >> automatically (without receiving requests from the Controller, because it >> can't talk to ZK to receive Controller requests) change their roles to be >> Slave if its ZK session is expired. For 2), Controller needs to change all >> relevant data stored on ZK to indicate that all replicas on the expired >> participant are Slaves, and then request some live participants to be the >> new Master and change the ZK data to indicate new Masters. My understanding >> is that Helix Controller always sends msg to participants to change their >> states and then update ZK data when responses are received from >> participants. This doesn't apply to an expired/dead participant. Because a >> dead participant can't act on a state change request. >> Please let me know if I missed anything and Helix has a straightforward >> way to solve it. >> >> Thanks, >> Bo >> >> >> On Tue, Jan 2, 2018 at 12:49 PM, kishore g <[email protected]> wrote: >> >>> Hi Bo, >>> >>> Sorry for the delay in responding. >>> >>> 1. That's right, you can pretty much use the existing code in Helix >>> to generate the initial mapping. In fact, just set the mode to SEMI_AUTO >>> and call rebalance API once - this will set up the initial ideal state >>> and >>> ensure that the MASTERs/SLAVE are evenly distributed. You can also invoke >>> rebalance api any time the number of nodes changes (add/remove nodes from >>> the cluster). >>> 2. This won't be a problem with SEMI AUTO mode since the >>> idealstate is fixed and is changed by explicitly invoking the rebalance >>> API. DROPPED messages will be sent only when the mapping in ideal state >>> changes. >>> 3. Yes, if you have thousands of participants, it is recommended to >>> run the rebalancer in the controller. >>> 4. With SEMI-AUTO mode, the data will never be deleted from the >>> participants. In case of ZK network partition, the participants will be >>> unreachable for the duration of the outage. Once the connection is >>> re-established, everything should return back to normal. Typically, this >>> can be avoided by ensuring that the ZK nodes are on different racks. >>> >>> thanks, >>> Kishore G >>> >>> >>> >>> On Thu, Dec 28, 2017 at 1:53 PM, Bo Liu <[email protected]> wrote: >>> >>>> Hi Kishore, >>>> >>>> The fullmatix example is very helpful. For my original questions, I >>>> think we can still let Helix decide role assignment. We just need to make >>>> the selected slave catch up before promoting it to the new Master in state >>>> transition handler function. We can also request other Slaves to pull >>>> updates from this new Master in the same handler function. We will add a >>>> constraint to allow at most one transition for a partition to avoid >>>> potential race. Please let us know if this solution has any other >>>> implications. >>>> >>>> After reading some code in both fullmatix and helix, I still have a few >>>> questions. >>>> >>>> 1. I plan to use semi_auto mode to manage our Master-Slave replicated >>>> storage system running on AWS ec2. A customized rebalancer will be used to >>>> generate shard mapping, and we rely on helix to determine master-slave role >>>> assignment (auto restore write availability when a host is down). From the >>>> code, it seems to me that Helix will make a host serve Master replicas only >>>> if it is on the top of preference list for every partition it serves. If >>>> this is the case, the customized rebalancer needs to carefully decide host >>>> order in preference list to evenly distribute Master replicas? Just wanted >>>> to know how much work we can save by reusing the role assignment logic from >>>> semi_auto mode comparing to customized mode. >>>> >>>> 2. I noticed that all non-alive hosts will be excluded >>>> from ResourceAssignment returned by computeBestPossiblePartitionState(). >>>> Does that mean Helix will mark all non-alive hosts DROPPED or just won't >>>> try to send any state transition messages to non-alive hosts? Partition >>>> replicas in our systems are expensive to rebuild. So we'd like to not drop >>>> all the data on a host if the host's ZK session is expired. What's the >>>> recommended way to achieve this? If a participant reconnect to ZK with a >>>> new session ID. Will it have to restart from the scratch? >>>> >>>> 3. I found that the fullmatix runs rebalancer in participants. If we >>>> have thousands of participants, is it better to run it in controller? >>>> Because zk will have less load to synchronize a few controllers than >>>> thousands participants. >>>> >>>> 4. How to protect the system during the events like network partition >>>> or ZK is unavailable? For example, 1/3 of participants couldn't connect to >>>> ZK and thus expire their ZK sessions. If possible, we want to avoid >>>> committing suicide on those 1/3 participants but to keep data in a reusable >>>> state. >>>> >>>> I am still new to Helix. Sorry for the overwhelming questions. >>>> >>>> Thanks, >>>> Bo >>>> >>>> >>>> On Sun, Dec 24, 2017 at 8:54 PM, Bo Liu <[email protected]> wrote: >>>> >>>>> Thank you, will take a look later. >>>>> >>>>> On Dec 24, 2017 19:26, "kishore g" <[email protected]> wrote: >>>>> >>>>>> https://github.com/kishoreg/fullmatix/tree/master/mysql-cluster >>>>>> >>>>>> Take a look at this recipe. >>>>>> >>>>>> >>>>>> On Sun, Dec 24, 2017 at 5:40 PM Bo Liu <[email protected]> wrote: >>>>>> >>>>>>> Hi Helix team, >>>>>>> >>>>>>> We have an application which runs with 1 Master and multiple Slaves >>>>>>> per shard. If a host is dead, we want to move the master role from the >>>>>>> dead >>>>>>> host to one of the slave hosts. In the meantime, we need to inform all >>>>>>> other Slaves start to pull updates from the new Master instead of the >>>>>>> old >>>>>>> one. How do you suggest to implement this with Helix? >>>>>>> >>>>>>> Another related question is can we add some logic to make Helix >>>>>>> choose new Master based on 1) which slave has the most recent updates >>>>>>> and >>>>>>> 2) try to evenly distribute Master shards (only if more than one Slave >>>>>>> have >>>>>>> the most recent updates). >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> Best regards, >>>>>>> Bo >>>>>>> >>>>>>> >>>> >>>> >>>> -- >>>> Best regards, >>>> Bo >>>> >>>> >>> >> >> >> -- >> Best regards, >> Bo >> >> >
