wow, you should really stop using Microsoft :-) On Sun, Nov 9, 2014 at 11:09 PM, Kanak Biscuitwala <[email protected]> wrote:
> I'm not sure why Microsoft is so bad at formatting email. Let me try that > again: Sorry for the late reply. Here are some general comments: - The > original behavior described is "by design" but probably handled too > stringently. The idea is that each partition replica should have some > "affinity" to nodes, so that if there are multiple node restart events, in > the general case the partition movement is minimized (i.e. the only > movements occur when the node with affinity to the partition either enters > or exits the cluster). - It's a known issue that lexicographically earlier > node names will by default have the higher "remainder" capacities. There is > a routine in the algorithm that allows "stealing" capacities, but that only > comes into play when an unassigned replica cannot be assigned to any node > with available capacity (i.e. all nodes with capacity are already serving a > replica for that partition). As Kishore said, we can definitely do better. > A good fix to this bug is to assign remainder capacities first based on > existing assignments. - There should never be a situation in which some > live instances have capacity C and others have capacity D>= C+2 (or <= > C-2). All live node capacities should differ by at most 1. If a node is not > live, then yes it has capacity 0 because it doesn't logically make sense > for a non-live node to accept replicas. If any of that is not true, then > there's a bug in the algorithm. - You don't need to hard-code the number of > instances, but it would probably help the algorithm if the new instances > are lexicographically after the existing ones. You can alternatively > provide a different ReplicaPlacementScheme which more appropriately handles > default node affinity in AutoRebalanceStrategy. - ZKHelixAdmin is > definitely more restrictive than it needs to be; it was implemented well > before AutoRebalanceStrategy was. - See these links for contributing to > Helix (either approach works for us, but submitting a pull request on > GitHub is probably faster for you): > https://cwiki.apache.org/confluence/display/HELIX/Contributor+Workflow > and > https://cwiki.apache.org/confluence/display/HELIX/Merging+Pull+Requests - > Kishore, I'm not sure what you're referring to regarding using > LeaderStandby. For FULL_AUTO LeaderStandby, we will definitely take > previous state assignment into account when computing a new assignment, and > won't initially change states if possible, but the algorithm does try to > eventually settle on an assignment where states are evenly distributed when > possible. This could mean extra state transitions that occur eventually in > order to maintain that balance. > From: [email protected] > To: > [email protected] > Subject: RE: Auto-rebalancing question > Date: > Sun, 9 Nov 2014 23:06:59 -0800 > > Sorry for the late reply. Here are some > general comments: > > - The original behavior described is "by design" but probably handled too > stringently. The idea is that each partition replica should have some > "affinity" to nodes, so that if there are multiple node restart events, in > the general case the partition movement is minimized (i.e. the only > movements occur when the node with affinity to the partition either enters > or exits the cluster). > > - It's a known issue that lexicographically earlier node names will by > default have the higher "remainder" capacities. There is a routine in the > algorithm that allows "stealing" capacities, but that only comes into play > when an unassigned replica cannot be assigned to any node with available > capacity (i.e. all nodes with capacity are already serving a replica for > that partition). As Kishore said, we can definitely do better. A good fix > to this bug is to assign remainder capacities first based on existing > assignments. > > - There should never be a situation in which some live instances have > capacity C and others have capacity D >= C+2 (or <= C-2). All live node > capacities should differ by at most 1. If a node is not live, then yes it > has capacity 0 because it doesn't logically make sense for a non-live node > to accept replicas. If any of that is not true, then there's a bug in the > algorithm. > > - You don't need to hard-code the number of instances, but it would > probably help the algorithm if the new instances are lexicographically > after the existing ones. You can alternatively provide a different > ReplicaPlacementScheme which more appropriately handles default node > affinity in AutoRebalanceStrategy. > > - ZKHelixAdmin is definitely more restrictive than it needs to be; it was > implemented well before AutoRebalanceStrategy was. > > - See these links for contributing to Helix (either approach works for us, > but submitting a pull request on GitHub is probably faster for you): > https://cwiki.apache.org/confluence/display/HELIX/Contributor+Workflow > and > https://cwiki.apache.org/confluence/display/HELIX/Merging+Pull+Requests > > - Kishore, I'm not sure what you're referring to regarding using > LeaderStandby. For FULL_AUTO LeaderStandby, we will definitely take > previous state assignment into account when computing a new assignment, and > won't initially change states if possible, but the algorithm does try to > eventually settle on an assignment where states are evenly distributed when > possible. This could mean extra state transitions that occur eventually in > order to maintain that balance. > > ________________________________ > > Date: Sun, 9 Nov 2014 22:43:15 -0800 > > Subject: Re: Auto-rebalancing question > > From: [email protected] > > To: [email protected] > > > > I will try this and get back to you. > > > > On Fri, Nov 7, 2014 at 8:21 AM, Tom Widmer > > <[email protected]<mailto:[email protected]>> wrote: > > On 6 Nov 2014, at 15:27, kishore g > > <[email protected]<mailto:[email protected]>> wrote: > > > > Thanks Tom. Good observation. The reason Helix moves back the partition > > is to maintain equal distribution of locks at all times, if we don't > > move it back the node that came back up will be idle. This assumes the > > number of replicas is more than number of nodes. > > > > I think I get this - if, say, all instances have a capacity of 2, then > > you might end up with some instances containing 2 and some 0, using the > > current rebalancing algorithm, which isn’t what you want (idle node). I > > guess the algorithm would need tweaking to make sure that every node > > had either capacity or capacity-1 partitions, so that those 0’s > > wouldn’t be acceptable in that case and would have partitions moved > > from nodes with full capacity. I could possibly look at making this > > change for you? I’d need info on how to submit patches. > > > > For single partition or in general when the number of numPartitions * > > numReplicas < nodes, I agree that moving back is unneccesary. We can > > think of changing the algorithm smarter. > > > > Same with second case, I expected minimum movement. Your suggestion > > makes sense. Kanak what do you think. > > > > For the single partition use case, I think you can probably use > > LeaderStandby model and set the number of replicas to be number of > > nodes. In this case, I believe the leader will not move back when the > > old node comes back up. Kanak/Jason I believe we made this change some > > time back. Correct me if I am wrong. > > > > I had a look at this option, but the problem is that I’d need to > > hard-code the number of instances, which I’d rather avoid. I guess it > > might work if I allocated a number larger than the expected number of > > nodes I’d ever have? > > > > I tried setting up a state machine with ’N’ standby nodes, but > > ZKHelixAdmin.rebalance has some checks saying you can only have: > > > > * no more than 1 state with an upper bound of 1 > > * no more than 1 state with an upper bound of R > > * no more than 1 state with an upper bound of N, in which case you > > can’t have any other states with either R or 1 as their upper bound > > (which messes up my case, where I’d want 1 leader and (N-1) standbys, > > ideally) > > > > Are those checks definitely all necessary for full-auto mode? > > > > Any alternatives other than writing a user-defined rebalancer? > > > > Thanks, > > > > Tom > > This email and any attachments are intended only for the addressees and > > may contain confidential and/or privileged material. Any processing of, > > or taking of any action in reliance upon, this information by persons > > or entities other than the intended addressees is prohibited. If you > > have received this in error, do not take a copy to your computer or > > removable media, or forward this email. Please contact the sender and > > delete this material. Cambridge Cognition has monitoring and scanning > > systems in place in relation to emails sent and received to: monitor / > > record business communications in order to prevent and detect crime; > > investigate the use of the Company's internal and external email > > system; and provide evidence of compliance with business practices. > > Company Registration Number 4338746 Registered address, Tunbridge > > Court, Tunbridge Lane, Bottisham, Cambridge, CB25 9TU, UK > > >
