Hey folks, I was wondering what guidance there would be on how to handle the following scenario:
1. We have a distributed DB with N number of shards with the partitioning, failover, etc. handled via Helix using the Master-Slave model with WagedRebalancer and using Full_Auto 2. Let's say Shard 1 gets assigned to Host 1 and we successfully transition to MASTER state. 3. It continues to be alive and happy for a period of time, but then a failure occurs which doesn't take the host offline but prevents the host from fully functioning. (A good example of this is corruption of the shard because of Disk Failure where parts of the SSD have been worn out) 4. We're able to see that we're unable to write to disk and want to rebalance that shard to elsewhere. What would be the recommended way of doing step 4 using Helix? I'm unsure what's the best way because we would have to communicate to Helix that the shard has transitioned to an ERROR state on that host so it can be rebalanced elsewhere. Up to this point we've only reacted to state transitions sent to Helix, so I would be curious how feedback like this would be given to Controller so it could rebalance correctly. Thanks, Will
