Thanks for the quick response Kishore. This issue is definitely tied to the condition that partitions * replicas < NODE_COUNT. If all running nodes have a "piece" of the resource, then they behave well when the LEADER node goes away.
Is it possible to use Helix to manage a set of resources where that condition is true? I.e. where the *total *number of partitions/replicas in the cluster is greater than the node count, but each individual resource has a small number of partitions/replicas. (Calling rebalance on every liveInstance change does not seem like a good solution, because you would have to iterate through all resources in the cluster and rebalance each individually.) On Wed, Oct 19, 2016 at 12:52 PM, kishore g <[email protected]> wrote: > I think this might be a corner case when partitions * replicas < > TOTAL_NUMBER_OF_NODES. Can you try with many partitions and replicas and > check if the issue still exists. > > > > On Wed, Oct 19, 2016 at 11:53 AM, Michael Craig <[email protected]> wrote: > >> I've noticed that partitions/replicas assigned to disconnected instances >> are not automatically redistributed to live instances. What's the correct >> way to do this? >> >> For example, given this setup with Helix 0.6.5: >> - 1 resource >> - 2 replicas >> - LeaderStandby state model >> - FULL_AUTO rebalance mode >> - 3 nodes (N1 is Leader, N2 is Standby, N3 is just sitting) >> >> Then drop N1: >> - N2 becomes LEADER >> - Nothing happens to N3 >> >> Naively, I would have expected N3 to transition from Offline to Standby, >> but that doesn't happen. >> >> I can force redistribution from GenericHelixController#onLiveInstanceChange >> by >> - dropping non-live instances from the cluster >> - calling rebalance >> >> The instance dropping seems pretty unsafe! Is there a better way? >> > >
