Question regarding a partition that goes into a failure state

William Morgan Mon, 29 Aug 2022 11:45:21 -0700

Hey folks,

I was wondering what guidance there would be on how to handle the following 
scenario:


  1.  We have a distributed DB with N number of shards with the partitioning, 
failover, etc. handled via Helix using the Master-Slave model with 
WagedRebalancer and using Full_Auto
  2.  Let's say Shard 1 gets assigned to Host 1 and we successfully transition 
to MASTER state.
  3.  It continues to be alive and happy for a period of time, but then a 
failure occurs which doesn't take the host offline but prevents the host from 
fully functioning. (A good example of this is corruption of the shard because 
of Disk Failure where parts of the SSD have been worn out)
  4.  We're able to see that we're unable to write to disk and want to 
rebalance that shard to elsewhere.

What would be the recommended way of doing step 4 using Helix?

I'm unsure what's the best way because we would have to communicate to Helix 
that the shard has transitioned to an ERROR state on that host so it can be 
rebalanced elsewhere. Up to this point we've only reacted to state transitions 
sent to Helix, so I would be curious how feedback like this would be given to 
Controller so it could rebalance correctly.

Thanks,
Will

Question regarding a partition that goes into a failure state

Reply via email to