Hi Will. To summarize, do you mean the node is broken but somehow the connection was always there and kept holding the leadership of that replica?
Depends on the scenario, - If the machine is gone and cannot do anything (like ssh, accepting Helix messages), the only thing you can do is bounce the machine. - If it is just some parts failure (like disk failure) but main process functioning, then you can try to disable that partition for that instance by using HelixAdmin API. The leadership will be switched out. Please let me know if I understand your story correctly. Best, Junkai On Mon, Aug 29, 2022 at 11:45 AM William Morgan <[email protected]> wrote: > Hey folks, > > I was wondering what guidance there would be on how to handle the > following scenario: > > 1. We have a distributed DB with N number of shards with the > partitioning, failover, etc. handled via Helix using the Master-Slave model > with WagedRebalancer and using Full_Auto > 2. Let's say Shard 1 gets assigned to Host 1 and we successfully > transition to MASTER state. > 3. It continues to be alive and happy for a period of time, but then a > failure occurs which doesn't take the host offline but prevents the host > from fully functioning. (A good example of this is corruption of the shard > because of Disk Failure where parts of the SSD have been worn out) > 4. We're able to see that we're unable to write to disk and want to > rebalance that shard to elsewhere. > > What would be the recommended way of doing step 4 using Helix? > > I'm unsure what's the best way because we would have to communicate to > Helix that the shard has transitioned to an ERROR state on that host so it > can be rebalanced elsewhere. Up to this point we've only reacted to state > transitions sent to Helix, so I would be curious how feedback like this > would be given to Controller so it could rebalance correctly. > > Thanks, > Will > > > -- Junkai Xue
