Re: Questions around Error Handling and Shuffling

kishore g Mon, 08 Oct 2018 18:01:33 -0700

1. Thats a good feature to have. One problem with that feature would be to
know when to stop assigning that partition to another node. There was no
way for Helix to know ERROR is transient or permanent. We do have retry on
the same node, we retry 3 times before going into ERROR state.

2. You can disable that partition on that node and Helix will reassign that
partition to another node. See HelixAdmin for more info on the API.
public abstract void enablePartition (boolean enabled, String
<http://download.oracle.com/javase/6/docs/api/index.html?java/lang/String.html>
 clusterName, String
<http://download.oracle.com/javase/6/docs/api/index.html?java/lang/String.html>
 instanceName, String
<http://download.oracle.com/javase/6/docs/api/index.html?java/lang/String.html>
 resourceName, List
<http://download.oracle.com/javase/6/docs/api/index.html?java/util/List.html>
<String
<http://download.oracle.com/javase/6/docs/api/index.html?java/lang/String.html>>
partitionNames)

Disable or enable a list of partitions on an instance

3. I am not sure if this is a good idea. Why would you want the nodes to be
idle? Distributed load evenly among all nodes is a good requirement right?

 Helix rebalancing strategy tries to minimize the partition movement - this
works well for unplanned failure. For Planned operations such as initial
cluster startup or rolling restart or bulk expansion, you can either use
delayed rebalancer or pause/unpause the controller.

thanks

Kishore G

On Mon, Oct 8, 2018 at 5:11 PM <[email protected]> wrote:

> Hi Helix community,
>
> Few questions regarding error handling at the partition level and
> rebalancing. I am using automatic rebalance mode with Leader/Standby
> transition.
>
> 1. Error can occur during state transition from STANDBY to LEADER. If an
> exception is thrown, the state changes to ERROR. However, the partition is
> not reassigned to another node immediately. The partition stays at ERROR
> state until a new node comes up. I wonder if there is a way to achieve the
> reassignment earlier and automatically (or periodic retry on same node). Is
> there a way to automatically transition from ERROR to DROPPED state?
>
> 2. During regular service of a partition, how can an instance signal an
> error only for one partition it is serving ? I would like for that single
> partition to be reassigned to another instance (or periodically retried on
> same instance if others do not have room).
>
> 3. It would be ideal if there was a setting for minimum partitions per
> node to prevent shuffle of partitions among instances when new nodes arrive
> into the cluster. Is such a rebalancing (or workaround) already present? I
> would rather have a few instances sit around idly as a spare instance ready
> for failover instead of having partitions shuffle around given that it
> takes some time to warm up a partition.
>
> Thanks,
> Vish
>

Re: Questions around Error Handling and Shuffling

Reply via email to