OK, it looks like you have a very unlucky number of tasks.

For each of your string values s, taking the Arrays#deepHashCode of the List of 
s gives integers that are very poorly distributed over 64.


user=> (def l '("0:499", "500:999", "1000:1499", "1500:1999",
"2000:2499", "2500:2999", "3000:3499", "3500:3999",
"4000:4499", "4500:4999", "5000:5499", "5500:5999",
"6000:6499", "6500:6999", "7000:7499", "7500:7999",
"8000:8499", "9500:9999" ))

user=> (def num-tasks 64)

user=> (defn f [^List l] (-> l t/list-hash-code (mod num-tasks)))

(41 51 29 61 61 29 29 61 61 29 29 61 61 29 29 61 61 61)

Half of these land on 61, almost half land on 29, and one lands on 41.


If the number of tasks is a nearby 65:

user=> (def num-tasks 65)

user=> (sort (for [x l] (f (list x))))
(1 7 8 14 14 20 21 27 28 34 39 47 52 53 54 58 59 61)

Only 14 occurs twice.



It seems your number of tasks is an unlucky modulo divisor.

 
-- 
Derek


________________________________
 From: Kashyap Mhaisekar <[email protected]>
To: [email protected]; Derek Dagit <[email protected]> 
Sent: Wednesday, September 30, 2015 10:18 AM
Subject: Re: Field Group Hash Computation
 


Thanks Derek. Here is the code and the results.
When the string is added to an ArrayList and then (hashCode % 64) is computed 
they come out same. 64 is the no. of consumer tasks. The hashcode of the 
strings by themselves is different.

My emit emits as -
collector.emit(new Values(str1,str2,str3)) where str3 is field grouped and has 
the string values in "arr" in the below program

---------------
package com.demo;

import java.util.ArrayList;
import java.util.Random;

public class HashTest {

public static void main(String[] args) {

String[] arr = { "0:499", "500:999", "1000:1499", "1500:1999",
"2000:2499", "2500:2999", "3000:3499", "3500:3999",
"4000:4499", "4500:4999", "5000:5499", "5500:5999",
"6000:6499", "6500:6999", "7000:7499", "7500:7999",
"8000:8499", "9500:9999" };

int tasks = 64;//
for (int i = 0; i < arr.length; i++) {
ArrayList<String> arl = new ArrayList<String>();
arl.add(arr[i]);

System.out.println("Hash: " + arr[i] + " -- (hash): "
+ (arl.hashCode()%tasks) + " -- (String's hashcode): " + arr[i].hashCode());
}
}
}

Results:
Hash: 0:499 -- (hash): 41 -- (String's hashcode): 46108682
Hash: 500:999 -- (hash): 51 -- (String's hashcode): 1213367572
Hash: 1000:1499 -- (hash): 29 -- (String's hashcode): 464373438
Hash: 1500:1999 -- (hash): 61 -- (String's hashcode): 588495326
Hash: 2000:2499 -- (hash): -3 -- (String's hashcode): -1343051234
Hash: 2500:2999 -- (hash): -35 -- (String's hashcode): -1218929346
Hash: 3000:3499 -- (hash): 29 -- (String's hashcode): 1144491390
Hash: 3500:3999 -- (hash): 61 -- (String's hashcode): 1268613278
Hash: 4000:4499 -- (hash): -3 -- (String's hashcode): -662933282
Hash: 4500:4999 -- (hash): -35 -- (String's hashcode): -538811394
Hash: 5000:5499 -- (hash): 29 -- (String's hashcode): 1824609342
Hash: 5500:5999 -- (hash): 61 -- (String's hashcode): 1948731230
Hash: 6000:6499 -- (hash): 61 -- (String's hashcode): 17184670
Hash: 6500:6999 -- (hash): 29 -- (String's hashcode): 141306558
Hash: 7000:7499 -- (hash): -35 -- (String's hashcode): -1790240002
Hash: 7500:7999 -- (hash): -3 -- (String's hashcode): -1666118114
Hash: 8000:8499 -- (hash): 61 -- (String's hashcode): 697302622
Hash: 9500:9999 -- (hash): -3 -- (String's hashcode): -986000162

----------------------

Thanks
kashyap




On Wed, Sep 30, 2015 at 9:20 AM, Derek Dagit <[email protected]> wrote:

> This hashcode is coming out same for different string combinations...
>
>As far as I understand, this can only happen with vanishingly small 
>probability.
>
>Here is the hashCode implementation for String:
>http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/lang/String.java#String.hashCode%28%29
>
>Here is the Arrays code that combines the hashes of the individual Strings:
>http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/util/Arrays.java#Arrays.deepHashCode%28java.lang.Object[]%29
>
>
>
>Would you share an example of different combinations of String field values 
>that hash to the same hashcode value?
>--
>Derek
>
>
>________________________________
>From: Kashyap Mhaisekar <[email protected]>
>To: [email protected]
>Sent: Tuesday, September 29, 2015 6:04 PM
>Subject: Re: Field Group Hash Computation
>
>
>
>
>Thanks guys. From what I understand, partial key grouping is used when you 
>know your grouping will create imbalance. In my case, most of my field groups 
>to one bolt thereby causing it to be a bottleneck. Since I emit string, I 
>guess the hash is on ArrayList(str1,str2...).hashcode(). This hashcode is 
>coming out same for different string combinations...
>Thanks
>Kashyap
>
>
>On Sep 29, 2015 17:51, "Matthias J. Sax" <[email protected]> wrote:
>
>If you can use "partial key grouping" depends on your use case. Think
>>careful before you apply it...
>>
>>Maybe you want to read the research paper about it. It clearly describes
>>when you can use it and when not:
>>https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
>>
>>
>>-Matthias
>>
>>On 09/30/2015 12:18 AM, Ken Danniswara wrote:
>>> Hi,
>>>
>>> From what I read, the default FieldGrouping did not balance the load as
>>> like ShuffleGrouping do. In this case, there is a discussion about
>>> custom Grouping implementation called partial key grouping where it have
>>> better balancing problem. Maybe it
>>> helps. https://github.com/gdfm/partial-key-grouping
>>>
>>> On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>     Thanks Derek. I use strings and I still end up with some bolts
>>>     having the maximum requests :(
>>>
>>>     On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit <[email protected]
>>>     <mailto:[email protected]>> wrote:
>>>
>>>         The code that hashes the field values is here:
>>>
>>>         
>>> https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24
>>>
>>>
>>>         You can write a little java program, something like:
>>>
>>>         public static void main(String[] args) {
>>>           ArrayList<String> myList = new ArrayList<String>();
>>>              myList.add("first field value");
>>>           myList.add("second field value");
>>>
>>>           int hash = Arrays.deephashCode(myList.toArray()); // as in
>>>         tuple.clj
>>>
>>>
>>>           System.out.println("hash is "+hash);
>>>           int numTasks = 32;
>>>
>>>           System.out.println("task index is " + hash % numTasks);
>>>
>>>         }
>>>
>>>
>>>         There are certain types of values that may not hash
>>>         consistently.  If you are using String values, then it should be
>>>         fine. Other types may or may not, depending on how the class
>>>         implements hashCode().
>>>
>>>
>>>         --
>>>         Derek
>>>
>>>
>>>         ________________________________
>>>         From: Kashyap Mhaisekar <[email protected]
>>>         <mailto:[email protected]>>
>>>         To: [email protected] <mailto:[email protected]>
>>>         Sent: Tuesday, September 29, 2015 4:28 PM
>>>         Subject: Field Group Hash Computation
>>>
>>>
>>>
>>>         Hi,
>>>         I have a field grouping based on 2 fields. I have 32 consumers
>>>         for the tuple and I see most of the times, out of 64 bolts, the
>>>         field group is always on 8 of them. Of the 8, 2 have more than
>>>         60% of the data. The data for the field grouping can have 20
>>>         different combinations.
>>>
>>>         Do you know what is the way to compute the Hash of the fields
>>>         used for computing? One of the groups mails indicate that the
>>>         approach is -
>>>
>>>         It calls "hashCode" on the list of selected values and mods it
>>>         by the
>>>         number of consumer tasks. You can play around with that function
>>>         to see if
>>>         something about your data is causing something degenerative to
>>>         happen and
>>>         cause skew
>>>
>>>         I saw the clojure code but not sure how to understand this.
>>>
>>>         Thanks
>>>         Kashyap
>>>
>>>
>>>
>>
>>
>

Reply via email to