> This hashcode is coming out same for different string combinations...

As far as I understand, this can only happen with vanishingly small probability.

Here is the hashCode implementation for String: 
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/lang/String.java#String.hashCode%28%29

Here is the Arrays code that combines the hashes of the individual Strings:
http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/util/Arrays.java#Arrays.deepHashCode%28java.lang.Object[]%29



Would you share an example of different combinations of String field values 
that hash to the same hashcode value? 
-- 
Derek


________________________________
From: Kashyap Mhaisekar <[email protected]>
To: [email protected] 
Sent: Tuesday, September 29, 2015 6:04 PM
Subject: Re: Field Group Hash Computation



Thanks guys. From what I understand, partial key grouping is used when you know 
your grouping will create imbalance. In my case, most of my field groups to one 
bolt thereby causing it to be a bottleneck. Since I emit string, I guess the 
hash is on ArrayList(str1,str2...).hashcode(). This hashcode is coming out same 
for different string combinations...
Thanks
Kashyap


On Sep 29, 2015 17:51, "Matthias J. Sax" <[email protected]> wrote:

If you can use "partial key grouping" depends on your use case. Think
>careful before you apply it...
>
>Maybe you want to read the research paper about it. It clearly describes
>when you can use it and when not:
>https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
>
>
>-Matthias
>
>On 09/30/2015 12:18 AM, Ken Danniswara wrote:
>> Hi,
>>
>> From what I read, the default FieldGrouping did not balance the load as
>> like ShuffleGrouping do. In this case, there is a discussion about
>> custom Grouping implementation called partial key grouping where it have
>> better balancing problem. Maybe it
>> helps. https://github.com/gdfm/partial-key-grouping
>>
>> On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     Thanks Derek. I use strings and I still end up with some bolts
>>     having the maximum requests :(
>>
>>     On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit <[email protected]
>>     <mailto:[email protected]>> wrote:
>>
>>         The code that hashes the field values is here:
>>
>>         
>> https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24
>>
>>
>>         You can write a little java program, something like:
>>
>>         public static void main(String[] args) {
>>           ArrayList<String> myList = new ArrayList<String>();
>>              myList.add("first field value");
>>           myList.add("second field value");
>>
>>           int hash = Arrays.deephashCode(myList.toArray()); // as in
>>         tuple.clj
>>
>>
>>           System.out.println("hash is "+hash);
>>           int numTasks = 32;
>>
>>           System.out.println("task index is " + hash % numTasks);
>>
>>         }
>>
>>
>>         There are certain types of values that may not hash
>>         consistently.  If you are using String values, then it should be
>>         fine. Other types may or may not, depending on how the class
>>         implements hashCode().
>>
>>
>>         --
>>         Derek
>>
>>
>>         ________________________________
>>         From: Kashyap Mhaisekar <[email protected]
>>         <mailto:[email protected]>>
>>         To: [email protected] <mailto:[email protected]>
>>         Sent: Tuesday, September 29, 2015 4:28 PM
>>         Subject: Field Group Hash Computation
>>
>>
>>
>>         Hi,
>>         I have a field grouping based on 2 fields. I have 32 consumers
>>         for the tuple and I see most of the times, out of 64 bolts, the
>>         field group is always on 8 of them. Of the 8, 2 have more than
>>         60% of the data. The data for the field grouping can have 20
>>         different combinations.
>>
>>         Do you know what is the way to compute the Hash of the fields
>>         used for computing? One of the groups mails indicate that the
>>         approach is -
>>
>>         It calls "hashCode" on the list of selected values and mods it
>>         by the
>>         number of consumer tasks. You can play around with that function
>>         to see if
>>         something about your data is causing something degenerative to
>>         happen and
>>         cause skew
>>
>>         I saw the clojure code but not sure how to understand this.
>>
>>         Thanks
>>         Kashyap
>>
>>
>>
>
>

Reply via email to