Thanks Derek. Here is the code and the results.
When the string is added to an ArrayList and then (hashCode % 64) is
computed they come out same. 64 is the no. of consumer tasks. The hashcode
of the strings by themselves is different.
My emit emits as -
collector.emit(new Values(str1,str2,str3)) where str3 is field grouped and
has the string values in "arr" in the below program
---------------
package com.demo;
import java.util.ArrayList;
import java.util.Random;
public class HashTest {
public static void main(String[] args) {
String[] arr = { "0:499", "500:999", "1000:1499", "1500:1999",
"2000:2499", "2500:2999", "3000:3499", "3500:3999",
"4000:4499", "4500:4999", "5000:5499", "5500:5999",
"6000:6499", "6500:6999", "7000:7499", "7500:7999",
"8000:8499", "9500:9999" };
int tasks = 64;//
for (int i = 0; i < arr.length; i++) {
ArrayList<String> arl = new ArrayList<String>();
arl.add(arr[i]);
System.out.println("Hash: " + arr[i] + " -- (hash): "
+ (arl.hashCode()%tasks) + " -- (String's hashcode): " + arr[i].hashCode());
}
}
}
Results:
Hash: 0:499 -- (hash): 41 -- (String's hashcode): 46108682
Hash: 500:999 -- (hash): 51 -- (String's hashcode): 1213367572
Hash: 1000:1499 -- (hash): 29 -- (String's hashcode): 464373438
Hash: 1500:1999 -- (hash): 61 -- (String's hashcode): 588495326
Hash: 2000:2499 -- (hash): -3 -- (String's hashcode): -1343051234
Hash: 2500:2999 -- (hash): -35 -- (String's hashcode): -1218929346
Hash: 3000:3499 -- (hash): 29 -- (String's hashcode): 1144491390
Hash: 3500:3999 -- (hash): 61 -- (String's hashcode): 1268613278
Hash: 4000:4499 -- (hash): -3 -- (String's hashcode): -662933282
Hash: 4500:4999 -- (hash): -35 -- (String's hashcode): -538811394
Hash: 5000:5499 -- (hash): 29 -- (String's hashcode): 1824609342
Hash: 5500:5999 -- (hash): 61 -- (String's hashcode): 1948731230
Hash: 6000:6499 -- (hash): 61 -- (String's hashcode): 17184670
Hash: 6500:6999 -- (hash): 29 -- (String's hashcode): 141306558
Hash: 7000:7499 -- (hash): -35 -- (String's hashcode): -1790240002
Hash: 7500:7999 -- (hash): -3 -- (String's hashcode): -1666118114
Hash: 8000:8499 -- (hash): 61 -- (String's hashcode): 697302622
Hash: 9500:9999 -- (hash): -3 -- (String's hashcode): -986000162
----------------------
Thanks
kashyap
On Wed, Sep 30, 2015 at 9:20 AM, Derek Dagit <[email protected]> wrote:
> > This hashcode is coming out same for different string combinations...
>
> As far as I understand, this can only happen with vanishingly small
> probability.
>
> Here is the hashCode implementation for String:
>
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/lang/String.java#String.hashCode%28%29
>
> Here is the Arrays code that combines the hashes of the individual Strings:
>
> http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/7u40-b43/java/util/Arrays.java#Arrays.deepHashCode%28java.lang.Object[]%29
>
>
>
> Would you share an example of different combinations of String field
> values that hash to the same hashcode value?
> --
> Derek
>
>
> ________________________________
> From: Kashyap Mhaisekar <[email protected]>
> To: [email protected]
> Sent: Tuesday, September 29, 2015 6:04 PM
> Subject: Re: Field Group Hash Computation
>
>
>
> Thanks guys. From what I understand, partial key grouping is used when you
> know your grouping will create imbalance. In my case, most of my field
> groups to one bolt thereby causing it to be a bottleneck. Since I emit
> string, I guess the hash is on ArrayList(str1,str2...).hashcode(). This
> hashcode is coming out same for different string combinations...
> Thanks
> Kashyap
>
>
> On Sep 29, 2015 17:51, "Matthias J. Sax" <[email protected]> wrote:
>
> If you can use "partial key grouping" depends on your use case. Think
> >careful before you apply it...
> >
> >Maybe you want to read the research paper about it. It clearly describes
> >when you can use it and when not:
> >
> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf
> >
> >
> >-Matthias
> >
> >On 09/30/2015 12:18 AM, Ken Danniswara wrote:
> >> Hi,
> >>
> >> From what I read, the default FieldGrouping did not balance the load as
> >> like ShuffleGrouping do. In this case, there is a discussion about
> >> custom Grouping implementation called partial key grouping where it have
> >> better balancing problem. Maybe it
> >> helps. https://github.com/gdfm/partial-key-grouping
> >>
> >> On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar <
> [email protected]
> >> <mailto:[email protected]>> wrote:
> >>
> >> Thanks Derek. I use strings and I still end up with some bolts
> >> having the maximum requests :(
> >>
> >> On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit <[email protected]
> >> <mailto:[email protected]>> wrote:
> >>
> >> The code that hashes the field values is here:
> >>
> >>
> https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24
> >>
> >>
> >> You can write a little java program, something like:
> >>
> >> public static void main(String[] args) {
> >> ArrayList<String> myList = new ArrayList<String>();
> >> myList.add("first field value");
> >> myList.add("second field value");
> >>
> >> int hash = Arrays.deephashCode(myList.toArray()); // as in
> >> tuple.clj
> >>
> >>
> >> System.out.println("hash is "+hash);
> >> int numTasks = 32;
> >>
> >> System.out.println("task index is " + hash % numTasks);
> >>
> >> }
> >>
> >>
> >> There are certain types of values that may not hash
> >> consistently. If you are using String values, then it should be
> >> fine. Other types may or may not, depending on how the class
> >> implements hashCode().
> >>
> >>
> >> --
> >> Derek
> >>
> >>
> >> ________________________________
> >> From: Kashyap Mhaisekar <[email protected]
> >> <mailto:[email protected]>>
> >> To: [email protected] <mailto:[email protected]>
> >> Sent: Tuesday, September 29, 2015 4:28 PM
> >> Subject: Field Group Hash Computation
> >>
> >>
> >>
> >> Hi,
> >> I have a field grouping based on 2 fields. I have 32 consumers
> >> for the tuple and I see most of the times, out of 64 bolts, the
> >> field group is always on 8 of them. Of the 8, 2 have more than
> >> 60% of the data. The data for the field grouping can have 20
> >> different combinations.
> >>
> >> Do you know what is the way to compute the Hash of the fields
> >> used for computing? One of the groups mails indicate that the
> >> approach is -
> >>
> >> It calls "hashCode" on the list of selected values and mods it
> >> by the
> >> number of consumer tasks. You can play around with that function
> >> to see if
> >> something about your data is causing something degenerative to
> >> happen and
> >> cause skew
> >>
> >> I saw the clojure code but not sure how to understand this.
> >>
> >> Thanks
> >> Kashyap
> >>
> >>
> >>
> >
> >
>