Glad that helps! Best,
Junkai On Tue, Mar 30, 2021 at 11:50 AM Phong X. Nguyen <[email protected]> wrote: > Hi everyone, > > We upgraded to Helix 1.0.1 and switched to the WAGED rebalancer and it > worked extremely well for us. Thank you again for all your assistance! > > On Mon, Feb 8, 2021 at 4:03 PM Wang Jiajun <[email protected]> wrote: > >> Hi Phong, >> >> The WAGED rebalancer respects the MAX_PARTITIONS_PER_INSTANCE >> automatically. So probably you don't need to do any specific configuration. >> However, you do need to be on the new version to use the WAGED rebalancer. >> >> Also to confirm what you said, I believe the consistent hashing based >> strategies (Crush and CrushEd) do not respect >> the MAX_PARTITIONS_PER_INSTANCE. I guess there was some design concern. >> >> Anyway, using WAGED is the current recommendation : ) Could you please >> have a try and let us know if it is a good fit? >> >> Best Regards, >> Jiajun >> >> >> On Mon, Feb 8, 2021 at 3:55 PM Xue Junkai <[email protected]> wrote: >> >>> CRUSHED is trying its best to evenly distribute the replicas. So you >>> dont need identical assignments for each of the instances? >>> If that's the case, I would suggest you to migrate to WAGED rebalancer >>> with constraints setup. For more details, you can refer: >>> https://github.com/apache/helix/wiki/Weight-aware-Globally-Evenly-distributed-Rebalancer >>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_apache_helix_wiki_Weight-2Daware-2DGlobally-2DEvenly-2Ddistributed-2DRebalancer&d=DwMFaQ&c=sWW_bEwW_mLyN3Kx2v57Q8e-CRbmiT9yOhqES_g_wVY&r=OK-6RxrdKOH6KRwDOySNaLx6hy0DI7lQsJgNkY9rapU&m=y8sLBZjx235emP_H8CEdxlyUfhGoxD7ogIhyTUj8qtA&s=hFcMAED5DL1uYTHQNjzaOQ2twDmdmS3-bgpLAbLZnRo&e=> >>> >>> Best, >>> >>> Junkai >>> >>> >>> On Mon, Feb 8, 2021 at 3:28 PM Phong X. Nguyen < >>> [email protected]> wrote: >>> >>>> I believe it's #2, but perhaps I should explain: >>>> >>>> Here's a simplified view of mapFields; >>>> "mapFields" : { >>>> "partition_11" : { >>>> "server05.verizonmedia.com" : "ONLINE" >>>> }, >>>> "partition_22" : { >>>> "server05.verizonmedia.com" : "ONLINE" >>>> }, >>>> }, >>>> >>>> Server 5 has partitions (replicas?) 11 and 22 assigned to it; and >>>> that's currently fine. We could, for example, have partition_17 also >>>> assigned, which would be fine, but if a fourth one were to be assigned then >>>> we stand a high likelihood of crashing. >>>> >>>> Bootstrapping replicas is also expensive, so we'd like to minimize that >>>> as well. >>>> >>>> On Mon, Feb 8, 2021 at 3:14 PM Xue Junkai <[email protected]> wrote: >>>> >>>>> Thanks Phong. Can you clarify which you are looking for? >>>>> 1. parallel number of state transitions for bootstrapping replicas. >>>>> 2. number of replicas holding in an instance for limitation. >>>>> >>>>> Best, >>>>> >>>>> Junkai >>>>> >>>>> >>>>> On Mon, Feb 8, 2021 at 3:06 PM Phong X. Nguyen < >>>>> [email protected]> wrote: >>>>> >>>>>> Hello! >>>>>> >>>>>> I'm currently on a project that uses Apache Helix 0.8.4 (with a >>>>>> pending upgrade to Helix 1.0.1) to distribute partitions across a number >>>>>> of >>>>>> hosts (currently 32 partitions, 16 hosts). Once a partition is allocated >>>>>> to >>>>>> a host a bunch of expensive initialization steps occur, and the >>>>>> system proceeds to do a bunch of computations for the partition on a >>>>>> scheduled interval. We seek to minimize initializations when possible. >>>>>> >>>>>> If a system goes down (due to either maintenance or failure), the >>>>>> partitions get reshuffled. Currently we are using >>>>>> the CrushEdRebalanceStrategy in the hopes of minimizing partition >>>>>> movements. However, we noticed that unlike the earlier AutoRebalancer >>>>>> scheme, the CrushEdRebalanceStrategy does not limit the number of >>>>>> partitions per node. In our case, this can cause severe out-of-memory >>>>>> issues, which will then cascade as node after node gets more and more >>>>>> partitions that it cannot handle. We have on rare occasion seen our >>>>>> entire >>>>>> cluster fail as a result, and then our production engineers must >>>>>> manually - >>>>>> and carefully - bring the system back online. This is undesirable. >>>>>> >>>>>> Does Helix have a rebalancing strategy that minimizes partition >>>>>> movement yet also permits enforcement of maximum partitions per node? >>>>>> >>>>>> Thanks, >>>>>> - Phong X. Nguyen >>>>>> >>>>>
