Hello!

I'm currently on a project that uses Apache Helix 0.8.4 (with a pending
upgrade to Helix 1.0.1) to distribute partitions across a number of hosts
(currently 32 partitions, 16 hosts). Once a partition is allocated to a
host a bunch of expensive initialization steps occur, and the
system proceeds to do a bunch of computations for the partition on a
scheduled interval. We seek to minimize initializations when possible.

If a system goes down (due to either maintenance or failure), the
partitions get reshuffled. Currently we are using
the CrushEdRebalanceStrategy in the hopes of minimizing partition
movements. However, we noticed that unlike the earlier AutoRebalancer
scheme, the CrushEdRebalanceStrategy does not limit the number of
partitions per node. In our case, this can cause severe out-of-memory
issues, which will then cascade as node after node gets more and more
partitions that it cannot handle. We have on rare occasion seen our entire
cluster fail as a result, and then our production engineers must manually -
and carefully - bring the system back online. This is undesirable.

Does Helix have a rebalancing strategy that minimizes partition movement
yet also permits enforcement of maximum partitions per node?

Thanks,
- Phong X. Nguyen

Reply via email to