Sorry for the late reply. Here are some general comments:<br><br>- The original 
behavior described is "by design" but probably handled too stringently. The 
idea is that each partition replica should have some "affinity" to nodes, so 
that if there are multiple node restart events, in the general case the 
partition movement is minimized (i.e. the only movements occur when the node 
with affinity to the partition either enters or exits the cluster).<br><br>- 
It's a known issue that lexicographically earlier node names will by default 
have the higher "remainder" capacities. There is a routine in the algorithm 
that allows "stealing" capacities, but that only comes into play when an 
unassigned replica cannot be assigned to any node with available capacity (i.e. 
all nodes with capacity are already serving a replica for that partition). As 
Kishore said, we can definitely do better. A good fix to this bug is to assign 
remainder capacities first based on existing assignments.<br><br>- There should 
never be a situation in which some live instances have capacity C and others 
have capacity D &gt;= C+2 (or &lt;= C-2). All live node capacities should 
differ by at most 1. If a node is not live, then yes it has capacity 0 because 
it doesn't logically make sense for a non-live node to accept replicas. If any 
of that is not true, then there's a bug in the algorithm.<br><br>- You don't 
need to hard-code the number of instances, but it would probably help the 
algorithm if the new instances are lexicographically after the existing ones. 
You can alternatively provide a different ReplicaPlacementScheme which more 
appropriately handles default node affinity in AutoRebalanceStrategy.<br><br>- 
ZKHelixAdmin is definitely more restrictive than it needs to be; it was 
implemented well before AutoRebalanceStrategy was.<br><br>- See these links for 
contributing to Helix (either approach works for us, but submitting a pull 
request on GitHub is probably faster for you):&nbsp;<a 
href="https://cwiki.apache.org/confluence/display/HELIX/Contributor+Workflow"; 
target="_blank">https://cwiki.apache.org/confluence/display/HELIX/Contributor+Workflow</a>&nbsp;and&nbsp;<a
 href="https://cwiki.apache.org/confluence/display/HELIX/Merging+Pull+Requests"; 
target="_blank">https://cwiki.apache.org/confluence/display/HELIX/Merging+Pull+Requests</a><br><br>-
 Kishore, I'm not sure what you're referring to regarding using LeaderStandby. 
For FULL_AUTO LeaderStandby, we will definitely take previous state assignment 
into account when computing a new assignment, and won't initially change states 
if possible, but the algorithm does try to eventually settle on an assignment 
where states are evenly distributed when possible. This could mean extra state 
transitions that occur eventually in order to maintain that 
balance.<br><br>________________________________<br>&gt; Date: Sun, 9 Nov 2014 
22:43:15 -0800 <br>&gt; Subject: Re: Auto-rebalancing question <br>&gt; From: 
[email protected] <br>&gt; To: [email protected] <br>&gt; <br>&gt; I will 
try this and get back to you. <br>&gt; <br>&gt; On Fri, Nov 7, 2014 at 8:21 AM, 
Tom Widmer <br>&gt; 
&lt;[email protected]&lt;mailto:[email protected]&gt;&gt; wrote: 
<br>&gt; On 6 Nov 2014, at 15:27, kishore g <br>&gt; 
&lt;[email protected]&lt;mailto:[email protected]&gt;&gt; wrote: <br>&gt; 
<br>&gt; Thanks Tom. Good observation. The reason Helix moves back the 
partition <br>&gt; is to maintain equal distribution of locks at all times, if 
we don't <br>&gt; move it back the node that came back up will be idle. This 
assumes the <br>&gt; number of replicas is more than number of nodes. <br>&gt; 
<br>&gt; I think I get this - if, say, all instances have a capacity of 2, then 
<br>&gt; you might end up with some instances containing 2 and some 0, using 
the <br>&gt; current rebalancing algorithm, which isn’t what you want (idle 
node). I <br>&gt; guess the algorithm would need tweaking to make sure that 
every node <br>&gt; had either capacity or capacity-1 partitions, so that those 
0’s <br>&gt; wouldn’t be acceptable in that case and would have partitions 
moved <br>&gt; from nodes with full capacity. I could possibly look at making 
this <br>&gt; change for you? I’d need info on how to submit patches. <br>&gt; 
<br>&gt; For single partition or in general when the number of numPartitions * 
<br>&gt; numReplicas &lt; nodes, I agree that moving back is unneccesary. We 
can <br>&gt; think of changing the algorithm smarter. <br>&gt; <br>&gt; Same 
with second case, I expected minimum movement. Your suggestion <br>&gt; makes 
sense. Kanak what do you think. <br>&gt; <br>&gt; For the single partition use 
case, I think you can probably use <br>&gt; LeaderStandby model and set the 
number of replicas to be number of <br>&gt; nodes. In this case, I believe the 
leader will not move back when the <br>&gt; old node comes back up. Kanak/Jason 
I believe we made this change some <br>&gt; time back. Correct me if I am 
wrong. <br>&gt; <br>&gt; I had a look at this option, but the problem is that 
I’d need to <br>&gt; hard-code the number of instances, which I’d rather avoid. 
I guess it <br>&gt; might work if I allocated a number larger than the expected 
number of <br>&gt; nodes I’d ever have? <br>&gt; <br>&gt; I tried setting up a 
state machine with ’N’ standby nodes, but <br>&gt; ZKHelixAdmin.rebalance has 
some checks saying you can only have: <br>&gt; <br>&gt; * no more than 1 state 
with an upper bound of 1 <br>&gt; * no more than 1 state with an upper bound of 
R <br>&gt; * no more than 1 state with an upper bound of N, in which case you 
<br>&gt; can’t have any other states with either R or 1 as their upper bound 
<br>&gt; (which messes up my case, where I’d want 1 leader and (N-1) standbys, 
<br>&gt; ideally) <br>&gt; <br>&gt; Are those checks definitely all necessary 
for full-auto mode? <br>&gt; <br>&gt; Any alternatives other than writing a 
user-defined rebalancer? <br>&gt; <br>&gt; Thanks, <br>&gt; <br>&gt; Tom 
<br>&gt; This email and any attachments are intended only for the addressees 
and <br>&gt; may contain confidential and/or privileged material. Any 
processing of, <br>&gt; or taking of any action in reliance upon, this 
information by persons <br>&gt; or entities other than the intended addressees 
is prohibited. If you <br>&gt; have received this in error, do not take a copy 
to your computer or <br>&gt; removable media, or forward this email. Please 
contact the sender and <br>&gt; delete this material. Cambridge Cognition has 
monitoring and scanning <br>&gt; systems in place in relation to emails sent 
and received to: monitor / <br>&gt; record business communications in order to 
prevent and detect crime; <br>&gt; investigate the use of the Company's 
internal and external email <br>&gt; system; and provide evidence of compliance 
with business practices. <br>&gt; Company Registration Number 4338746 
Registered address, Tunbridge <br>&gt; Court, Tunbridge Lane, Bottisham, 
Cambridge, CB25 9TU, UK <br>&gt; <br>                                     

Reply via email to