Okay thanks for the lead. Will try this and reporr back
On Friday, March 25, 2016, kishore g <[email protected]> wrote:
> so computeOrphans is the one thats causing the behavior.
>
> In the beginning when nothing is assigned, all replicas are considered as
> orphans. Once they are considered as Orphan, they get assigned to any
> random node (this overrides everything thats computed by the placement
> scheme)
>
> I think the logic in computeOrphaned is broken, a replica should be
> treated as Orphan if the preferred node is not part of live node list.
>
> Try this in computeOrphaned. Note, the test case might fail because of
> this change and you will might have to change that according to new
> behavior. I think it will be good to introduce this behavior based on
> cluster config parameter.
>
> private Set<Replica> computeOrphaned() {
> Set<Replica> orphanedPartitions = new TreeSet<Replica>();
> for(Entry<Replica, Node> entry:_preferredAssignment.entrySet()){
> if(!_liveNodesList.contains(entry.getValue())){
> orphanedPartitions.add(entry.getKey());
> }
> }
> for (Replica r : _existingPreferredAssignment.keySet()) {
> if (orphanedPartitions.contains(r)) {
> orphanedPartitions.remove(r);
> }
> }
> for (Replica r : _existingNonPreferredAssignment.keySet()) {
> if (orphanedPartitions.contains(r)) {
> orphanedPartitions.remove(r);
> }
> }
>
> return orphanedPartitions;
> }
>
> On Fri, Mar 25, 2016 at 8:41 AM, Vinoth Chandar <[email protected]
> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>
>> Here you go
>>
>> https://gist.github.com/vinothchandar/18feedfa84650e3efdc0
>>
>>
>> On Fri, Mar 25, 2016 at 8:32 AM, kishore g <[email protected]
>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>
>>> Can you point me to your code. fork/patch?
>>>
>>> On Fri, Mar 25, 2016 at 5:26 AM, Vinoth Chandar <[email protected]
>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>>
>>>> Hi Kishore,
>>>>
>>>> Printed out more information and trimmed the test down to 1 resource
>>>> with 2 partitions, and I bring up 8 servers in parallel.
>>>>
>>>> Below is the paste of my logging output + annotations.
>>>>
>>>> >>> Computing partition assignment
>>>> >>>> NodeShift for countLog-2a 0 is 5, index 5
>>>> >>>> NodeShift for countLog-2a 1 is 5, index 6
>>>>
>>>> VC: So this part seems fine. We pick nodes at index 5 & 6 instead of 0,
>>>> 1
>>>>
>>>> >>>> Preferred Assignment: {countLog-2a_0|0=##########
>>>> name=localhost-server-6
>>>> preferred:0
>>>> nonpreferred:0, countLog-2a_1|0=##########
>>>> name=localhost-server-7
>>>> preferred:0
>>>> nonpreferred:0}
>>>>
>>>> VC: This translates to server-6/server-7 (since I named them starting 1)
>>>>
>>>> >>>> Existing Preferred Assignment: {}
>>>> >>>> Existing Non Preferred Assignment: {}
>>>> >>>> Orphaned: [countLog-2a_0|0, countLog-2a_1|0]
>>>> >>> Final State Map :{0=ONLINE}
>>>> >>>> Final ZK record : countLog-2a,
>>>> {}{countLog-2a_0={localhost-server-1=ONLINE},
>>>> countLog-2a_1={localhost-server-1=ONLINE}}{countLog-2a_0=[localhost-server-1],
>>>> countLog-2a_1=[localhost-server-1]}
>>>>
>>>> VC: But the final effect still seems to be assigning the partitions to
>>>> servers 1 & 2 (first two).
>>>>
>>>> Any ideas on where to start poking?
>>>>
>>>>
>>>> Thanks
>>>> Vinoth
>>>>
>>>> On Tue, Mar 15, 2016 at 5:52 PM, Vinoth Chandar <[email protected]
>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>>>
>>>>> Hi Kishore,
>>>>>
>>>>> I think the changes I made are exercised when computing the preferred
>>>>> assignment, later when the reconciliation happens with existing
>>>>> assignment/orphaned partitions etc, I think it does not take effect.
>>>>>
>>>>> The effective assignment I saw was all partitions (2 per resource)
>>>>> were assigned to first 2 servers. I started to dig into the above
>>>>> mentioned
>>>>> parts of the code, will report back tmrw when I pick this back up.
>>>>>
>>>>> Thanks,
>>>>> Vinoth
>>>>>
>>>>> _____________________________
>>>>> From: kishore g <[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>>> Sent: Tuesday, March 15, 2016 2:01 PM
>>>>> Subject: Re: Balancing out skews in FULL_AUTO mode with built-in
>>>>> rebalancer
>>>>> To: <[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>>
>>>>>
>>>>>
>>>>>
>>>>> 1) I am guessing it gets overriden by other logic in
>>>>> computePartitionAssignment(..), the end assignment is still skewed.
>>>>>
>>>>> What is the logic you are referring to?
>>>>>
>>>>> Can you print the assignment count for your use case?
>>>>>
>>>>>
>>>>> thanks,
>>>>> Kishore G
>>>>>
>>>>> On Tue, Mar 15, 2016 at 1:45 PM, Vinoth Chandar <[email protected]
>>>>> <javascript:_e(%7B%7D,'cvml','[email protected]');>> wrote:
>>>>>
>>>>>> Hi guys,
>>>>>>
>>>>>> We are hitting a fairly known issue where we have 100s of resource
>>>>>> with < 8 resources spreading across 10 servers and the built-in
>>>>>> assignment
>>>>>> always assigns partitions from first to last, resulting in heavy skew
>>>>>> for a
>>>>>> few nodes.
>>>>>>
>>>>>> Chatted with Kishore offline and made a patch as here
>>>>>> <https://gist.github.com/vinothchandar/e8837df301501f85e257>.Tested
>>>>>> with 5 resources with 2 partitions each across 8 servers, logging out the
>>>>>> nodeShift & ultimate index picked does indicate that we choose servers
>>>>>> other than the first two, which is good
>>>>>>
>>>>>> But
>>>>>> 1) I am guessing it gets overriden by other logic in
>>>>>> computePartitionAssignment(..), the end assignment is still skewed.
>>>>>> 2) Even with murmur hash, there is some skew on the nodeshift, which
>>>>>> needs to ironed out.
>>>>>>
>>>>>> I will keep chipping at this.. Any feedback appreciated
>>>>>>
>>>>>> Thanks
>>>>>> Vinoth
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>