Is the computation right for hash? ArrayList(str1,str2...).hashcode() where str1,str2 etc are fields being grouped?
Thanks Kashyap On Sep 29, 2015 18:04, "Kashyap Mhaisekar" <[email protected]> wrote: > Thanks guys. From what I understand, partial key grouping is used when you > know your grouping will create imbalance. In my case, most of my field > groups to one bolt thereby causing it to be a bottleneck. Since I emit > string, I guess the hash is on ArrayList(str1,str2...).hashcode(). This > hashcode is coming out same for different string combinations... > > Thanks > Kashyap > On Sep 29, 2015 17:51, "Matthias J. Sax" <[email protected]> wrote: > >> If you can use "partial key grouping" depends on your use case. Think >> careful before you apply it... >> >> Maybe you want to read the research paper about it. It clearly describes >> when you can use it and when not: >> >> https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf >> >> >> -Matthias >> >> On 09/30/2015 12:18 AM, Ken Danniswara wrote: >> > Hi, >> > >> > From what I read, the default FieldGrouping did not balance the load as >> > like ShuffleGrouping do. In this case, there is a discussion about >> > custom Grouping implementation called partial key grouping where it have >> > better balancing problem. Maybe it >> > helps. https://github.com/gdfm/partial-key-grouping >> > >> > On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar < >> [email protected] >> > <mailto:[email protected]>> wrote: >> > >> > Thanks Derek. I use strings and I still end up with some bolts >> > having the maximum requests :( >> > >> > On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit <[email protected] >> > <mailto:[email protected]>> wrote: >> > >> > The code that hashes the field values is here: >> > >> > >> https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24 >> > >> > >> > You can write a little java program, something like: >> > >> > public static void main(String[] args) { >> > ArrayList<String> myList = new ArrayList<String>(); >> > myList.add("first field value"); >> > myList.add("second field value"); >> > >> > int hash = Arrays.deephashCode(myList.toArray()); // as in >> > tuple.clj >> > >> > >> > System.out.println("hash is "+hash); >> > int numTasks = 32; >> > >> > System.out.println("task index is " + hash % numTasks); >> > >> > } >> > >> > >> > There are certain types of values that may not hash >> > consistently. If you are using String values, then it should be >> > fine. Other types may or may not, depending on how the class >> > implements hashCode(). >> > >> > >> > -- >> > Derek >> > >> > >> > ________________________________ >> > From: Kashyap Mhaisekar <[email protected] >> > <mailto:[email protected]>> >> > To: [email protected] <mailto:[email protected]> >> > Sent: Tuesday, September 29, 2015 4:28 PM >> > Subject: Field Group Hash Computation >> > >> > >> > >> > Hi, >> > I have a field grouping based on 2 fields. I have 32 consumers >> > for the tuple and I see most of the times, out of 64 bolts, the >> > field group is always on 8 of them. Of the 8, 2 have more than >> > 60% of the data. The data for the field grouping can have 20 >> > different combinations. >> > >> > Do you know what is the way to compute the Hash of the fields >> > used for computing? One of the groups mails indicate that the >> > approach is - >> > >> > It calls "hashCode" on the list of selected values and mods it >> > by the >> > number of consumer tasks. You can play around with that function >> > to see if >> > something about your data is causing something degenerative to >> > happen and >> > cause skew >> > >> > I saw the clojure code but not sure how to understand this. >> > >> > Thanks >> > Kashyap >> > >> > >> > >> >>
