Re: change slaves' upstream when moving Master role

Bo Liu Thu, 28 Dec 2017 13:54:07 -0800

Hi Kishore,

The fullmatix example is very helpful. For my original questions, I think
we can still let Helix decide role assignment. We just need to make the
selected slave catch up before promoting it to the new Master in state
transition handler function. We can also request other Slaves to pull
updates from this new Master in the same handler function. We will add a
constraint to allow at most one transition for a partition to avoid
potential race. Please let us know if this solution has any other
implications.

After reading some code in both fullmatix and helix, I still have a few
questions.

1. I plan to use semi_auto mode to manage our Master-Slave replicated
storage system running on AWS ec2. A customized rebalancer will be used to
generate shard mapping, and we rely on helix to determine master-slave role
assignment (auto restore write availability when a host is down). From the
code, it seems to me that Helix will make a host serve Master replicas only
if it is on the top of preference list for every partition it serves. If
this is the case, the customized rebalancer needs to carefully decide host
order in preference list to evenly distribute Master replicas? Just wanted
to know how much work we can save by reusing the role assignment logic from
semi_auto mode comparing to customized mode.

2. I noticed that all non-alive hosts will be excluded
from ResourceAssignment returned by computeBestPossiblePartitionState().
Does that mean Helix will mark all non-alive hosts DROPPED or just won't
try to send any state transition messages to non-alive hosts? Partition
replicas in our systems are expensive to rebuild. So we'd like to not drop
all the data on a host if the host's ZK session is expired. What's the
recommended way to achieve this? If a participant reconnect to ZK with a
new session ID. Will it have to restart from the scratch?

3. I found that the fullmatix runs rebalancer in participants. If we have
thousands of participants, is it better to run it in controller? Because
zk will have less load to synchronize a few controllers than thousands
participants.

4. How to protect the system during the events like network partition or ZK
is unavailable? For example, 1/3 of participants couldn't connect to ZK and
thus expire their ZK sessions. If possible, we want to avoid committing
suicide on those 1/3 participants but to keep data in a reusable state.

I am still new to Helix. Sorry for the overwhelming questions.

Thanks,
Bo

On Sun, Dec 24, 2017 at 8:54 PM, Bo Liu <[email protected]> wrote:

> Thank you,  will take a look later.
>
> On Dec 24, 2017 19:26, "kishore g" <[email protected]> wrote:
>
>> https://github.com/kishoreg/fullmatix/tree/master/mysql-cluster
>>
>> Take a look at this recipe.
>>
>>
>> On Sun, Dec 24, 2017 at 5:40 PM Bo Liu <[email protected]> wrote:
>>
>>> Hi Helix team,
>>>
>>> We have an application which runs with 1 Master and multiple Slaves per
>>> shard. If a host is dead, we want to move the master role from the dead
>>> host to one of the slave hosts. In the meantime, we need to inform all
>>> other Slaves start to pull updates from the new Master instead of the old
>>> one. How do you suggest to implement this with Helix?
>>>
>>> Another related question is can we add some logic to make Helix choose
>>> new Master based on 1) which slave has the most recent updates and 2) try
>>> to evenly distribute Master shards (only if more than one Slave have the
>>> most recent updates).
>>>
>>>
>>> --
>>> Best regards,
>>> Bo
>>>
>>>

-- 
Best regards,
Bo

Re: change slaves' upstream when moving Master role

Reply via email to