Ok that makes sense, I'll make use of the disable API as it looks to do what I 
require.

Thanks!

Will
________________________________
From: Junkai Xue <[email protected]>
Sent: Monday, August 29, 2022 4:09 PM
To: [email protected] <[email protected]>
Subject: Re: Question regarding a partition that goes into a failure state

The disable API only helps to move the replica state from Master to another 
host. And then that replica will be marked as OFFLINE state in the original 
host.

If you prefer to totally move out the replica, I would suggest you use a hybrid 
model in the assignment. It means in the RESOURCE folder, you can create a 
resource config. For that specific partition, you can define the preference 
list like SEMI AUTO did. So it will be a hybrid model where other partitions 
will leverage FULL_AUTO waged algorithms but that partition will depend on the 
input of the preference list for assignment. Otherwise, Helix does not provide 
any API that allows you to "move" the replica to other hosts in FULL_AUTO.

Best,

Junkai

On Mon, Aug 29, 2022 at 12:40 PM William Morgan 
<[email protected]<mailto:[email protected]>> wrote:
The idea is that we have N hosts that have M number of shards assigned to each 
host by Helix. There can be a situation where the host is still overall 
healthy, but a shard isn't on that host.

So, if I'm understanding correctly, the way to communicate to helix that the 
shard should be moved off the host would be to mark it as disabled via the 
HelixAdmin API.

To delve further into this line of marking a partition disabled on a host, what 
does this mean in the context of Helix? Just that the shard for that resource 
can no longer be scheduled onto that host?

Thanks for the help!

Will


________________________________
From: Junkai Xue <[email protected]<mailto:[email protected]>>
Sent: Monday, August 29, 2022 2:55 PM
To: [email protected]<mailto:[email protected]> 
<[email protected]<mailto:[email protected]>>
Subject: Re: Question regarding a partition that goes into a failure state

Hi Will.

To summarize, do you mean the node is broken but somehow the connection was 
always there and kept holding the leadership of that replica?

Depends on the scenario,

  *   If the machine is gone and cannot do anything (like ssh, accepting Helix 
messages), the only thing you can do is bounce the machine.
  *   If it is just some parts failure (like disk failure) but main process 
functioning, then you can try to disable that partition for that instance by 
using HelixAdmin API. The leadership will be switched out.

Please let me know if I understand your story correctly.

Best,

Junkai

On Mon, Aug 29, 2022 at 11:45 AM William Morgan 
<[email protected]<mailto:[email protected]>> wrote:
Hey folks,

I was wondering what guidance there would be on how to handle the following 
scenario:

  1.  We have a distributed DB with N number of shards with the partitioning, 
failover, etc. handled via Helix using the Master-Slave model with 
WagedRebalancer and using Full_Auto
  2.  Let's say Shard 1 gets assigned to Host 1 and we successfully transition 
to MASTER state.
  3.  It continues to be alive and happy for a period of time, but then a 
failure occurs which doesn't take the host offline but prevents the host from 
fully functioning. (A good example of this is corruption of the shard because 
of Disk Failure where parts of the SSD have been worn out)
  4.  We're able to see that we're unable to write to disk and want to 
rebalance that shard to elsewhere.

What would be the recommended way of doing step 4 using Helix?

I'm unsure what's the best way because we would have to communicate to Helix 
that the shard has transitioned to an ERROR state on that host so it can be 
rebalanced elsewhere. Up to this point we've only reacted to state transitions 
sent to Helix, so I would be curious how feedback like this would be given to 
Controller so it could rebalance correctly.

Thanks,
Will




--
Junkai Xue

Reply via email to