Thats right SEMI_AUTO will only change the role of the replica. It will never move the replicas.
Instead of answering each question, I will try to explain what happens under the hood. - Each participant maintains a persistent connection with Zookeeper and sends the heartbeat every X seconds. I think this is called tick time. - When the participant fails to send heartbeat, there is a disconnect callback from local ZK client code. Note this callback does not come from the ZK server, it will occur as soon as the participant fails to send the heartbeat. - Let's say the participant connects back to ZK after period T. Now there are two cases. - T < session timeout: In this case, the participant gets "connected" callback and its session is still valid and nothing has changed from ZK server/Helix controller/spectator point of view. - T > session timeout: This is when the participant gets "session expiry" callback from the ZK. Note that this happens only after the participant reconnects to ZK. So it might be minutes or even hours (depending on the cause of disconnection from ZK) before the participant gets this call back. But outside world - ZK Server/Controller/Spectator will know about the session expiry immediately after the session timeout. Helix gets to know about the session expiry and will initiate a mastership transfer from old master to new master. It cannot send any Master - Slave transition message to the old master because the old master is disconnected from ZK and is unreachable. Helix will automatically change the external View to reflect that the old master is offline for all the replicas it owns. The clients (spectators) will immediately know about this and they can stop sending requests to the old master. Similarly, once the new master processes the slave to master transition is successful, the external view will be updated and the clients (spectators) can now start routing the requests to the new master. As you pointed out in your email, you can start a timer in participant after you get a disconnected event and after session timeout time, stop processing the requests. We could have done this automatically in Helix but it really depends on the application. This is typically needed only in master-slave state model and we could not come up with automatic way. But we could have potentially done this based on a config variable. It will be awesome if you can contribute this feature. The controller will change all the relevant data structures in ZK when the node goes down (session expires). There is no need for any extra work here. Thanks, Kishore G On Tue, Jan 2, 2018 at 7:03 PM, Bo Liu <[email protected]> wrote: > Hi Kishore, > > Thanks for the answers. > > My understanding is that Helix with SEMI AUTO mode won't change shard > mapping autotically, but may change the roles of each replica. Please > correct me if this is wrong. > I am wondering how will SEMI AUTO Helix change the roles of replicas > mastered on a participant whose ZK session is just expired? Ideally, we > want to first 1) change the role of master replicas on the expired > participant to Slave, and then 2) promote some other live participant to be > the new Masters for those partitions. > For 1), we can add some timer logic at the participant side to > automatically (without receiving requests from the Controller, because it > can't talk to ZK to receive Controller requests) change their roles to be > Slave if its ZK session is expired. For 2), Controller needs to change all > relevant data stored on ZK to indicate that all replicas on the expired > participant are Slaves, and then request some live participants to be the > new Master and change the ZK data to indicate new Masters. My understanding > is that Helix Controller always sends msg to participants to change their > states and then update ZK data when responses are received from > participants. This doesn't apply to an expired/dead participant. Because a > dead participant can't act on a state change request. > Please let me know if I missed anything and Helix has a straightforward > way to solve it. > > Thanks, > Bo > > > On Tue, Jan 2, 2018 at 12:49 PM, kishore g <[email protected]> wrote: > >> Hi Bo, >> >> Sorry for the delay in responding. >> >> 1. That's right, you can pretty much use the existing code in Helix >> to generate the initial mapping. In fact, just set the mode to SEMI_AUTO >> and call rebalance API once - this will set up the initial ideal state and >> ensure that the MASTERs/SLAVE are evenly distributed. You can also invoke >> rebalance api any time the number of nodes changes (add/remove nodes from >> the cluster). >> 2. This won't be a problem with SEMI AUTO mode since the >> idealstate is fixed and is changed by explicitly invoking the rebalance >> API. DROPPED messages will be sent only when the mapping in ideal state >> changes. >> 3. Yes, if you have thousands of participants, it is recommended to >> run the rebalancer in the controller. >> 4. With SEMI-AUTO mode, the data will never be deleted from the >> participants. In case of ZK network partition, the participants will be >> unreachable for the duration of the outage. Once the connection is >> re-established, everything should return back to normal. Typically, this >> can be avoided by ensuring that the ZK nodes are on different racks. >> >> thanks, >> Kishore G >> >> >> >> On Thu, Dec 28, 2017 at 1:53 PM, Bo Liu <[email protected]> wrote: >> >>> Hi Kishore, >>> >>> The fullmatix example is very helpful. For my original questions, I >>> think we can still let Helix decide role assignment. We just need to make >>> the selected slave catch up before promoting it to the new Master in state >>> transition handler function. We can also request other Slaves to pull >>> updates from this new Master in the same handler function. We will add a >>> constraint to allow at most one transition for a partition to avoid >>> potential race. Please let us know if this solution has any other >>> implications. >>> >>> After reading some code in both fullmatix and helix, I still have a few >>> questions. >>> >>> 1. I plan to use semi_auto mode to manage our Master-Slave replicated >>> storage system running on AWS ec2. A customized rebalancer will be used to >>> generate shard mapping, and we rely on helix to determine master-slave role >>> assignment (auto restore write availability when a host is down). From the >>> code, it seems to me that Helix will make a host serve Master replicas only >>> if it is on the top of preference list for every partition it serves. If >>> this is the case, the customized rebalancer needs to carefully decide host >>> order in preference list to evenly distribute Master replicas? Just wanted >>> to know how much work we can save by reusing the role assignment logic from >>> semi_auto mode comparing to customized mode. >>> >>> 2. I noticed that all non-alive hosts will be excluded >>> from ResourceAssignment returned by computeBestPossiblePartitionState(). >>> Does that mean Helix will mark all non-alive hosts DROPPED or just won't >>> try to send any state transition messages to non-alive hosts? Partition >>> replicas in our systems are expensive to rebuild. So we'd like to not drop >>> all the data on a host if the host's ZK session is expired. What's the >>> recommended way to achieve this? If a participant reconnect to ZK with a >>> new session ID. Will it have to restart from the scratch? >>> >>> 3. I found that the fullmatix runs rebalancer in participants. If we >>> have thousands of participants, is it better to run it in controller? >>> Because zk will have less load to synchronize a few controllers than >>> thousands participants. >>> >>> 4. How to protect the system during the events like network partition or >>> ZK is unavailable? For example, 1/3 of participants couldn't connect to ZK >>> and thus expire their ZK sessions. If possible, we want to avoid committing >>> suicide on those 1/3 participants but to keep data in a reusable state. >>> >>> I am still new to Helix. Sorry for the overwhelming questions. >>> >>> Thanks, >>> Bo >>> >>> >>> On Sun, Dec 24, 2017 at 8:54 PM, Bo Liu <[email protected]> wrote: >>> >>>> Thank you, will take a look later. >>>> >>>> On Dec 24, 2017 19:26, "kishore g" <[email protected]> wrote: >>>> >>>>> https://github.com/kishoreg/fullmatix/tree/master/mysql-cluster >>>>> >>>>> Take a look at this recipe. >>>>> >>>>> >>>>> On Sun, Dec 24, 2017 at 5:40 PM Bo Liu <[email protected]> wrote: >>>>> >>>>>> Hi Helix team, >>>>>> >>>>>> We have an application which runs with 1 Master and multiple Slaves >>>>>> per shard. If a host is dead, we want to move the master role from the >>>>>> dead >>>>>> host to one of the slave hosts. In the meantime, we need to inform all >>>>>> other Slaves start to pull updates from the new Master instead of the old >>>>>> one. How do you suggest to implement this with Helix? >>>>>> >>>>>> Another related question is can we add some logic to make Helix >>>>>> choose new Master based on 1) which slave has the most recent updates and >>>>>> 2) try to evenly distribute Master shards (only if more than one Slave >>>>>> have >>>>>> the most recent updates). >>>>>> >>>>>> >>>>>> -- >>>>>> Best regards, >>>>>> Bo >>>>>> >>>>>> >>> >>> >>> -- >>> Best regards, >>> Bo >>> >>> >> > > > -- > Best regards, > Bo > >
