Thanks guys. From what I understand, partial key grouping is used when you know your grouping will create imbalance. In my case, most of my field groups to one bolt thereby causing it to be a bottleneck. Since I emit string, I guess the hash is on ArrayList(str1,str2...).hashcode(). This hashcode is coming out same for different string combinations...
Thanks Kashyap On Sep 29, 2015 17:51, "Matthias J. Sax" <[email protected]> wrote: > If you can use "partial key grouping" depends on your use case. Think > careful before you apply it... > > Maybe you want to read the research paper about it. It clearly describes > when you can use it and when not: > > https://melmeric.files.wordpress.com/2014/11/the-power-of-both-choices-practical-load-balancing-for-distributed-stream-processing-engines.pdf > > > -Matthias > > On 09/30/2015 12:18 AM, Ken Danniswara wrote: > > Hi, > > > > From what I read, the default FieldGrouping did not balance the load as > > like ShuffleGrouping do. In this case, there is a discussion about > > custom Grouping implementation called partial key grouping where it have > > better balancing problem. Maybe it > > helps. https://github.com/gdfm/partial-key-grouping > > > > On Wed, Sep 30, 2015 at 12:11 AM, Kashyap Mhaisekar <[email protected] > > <mailto:[email protected]>> wrote: > > > > Thanks Derek. I use strings and I still end up with some bolts > > having the maximum requests :( > > > > On Tue, Sep 29, 2015 at 5:03 PM, Derek Dagit <[email protected] > > <mailto:[email protected]>> wrote: > > > > The code that hashes the field values is here: > > > > > https://github.com/apache/storm/blob/9d911ec1b4f7b5aabe646a5d2cd31591fe4df1b0/storm-core/src/clj/backtype/storm/tuple.clj#L24 > > > > > > You can write a little java program, something like: > > > > public static void main(String[] args) { > > ArrayList<String> myList = new ArrayList<String>(); > > myList.add("first field value"); > > myList.add("second field value"); > > > > int hash = Arrays.deephashCode(myList.toArray()); // as in > > tuple.clj > > > > > > System.out.println("hash is "+hash); > > int numTasks = 32; > > > > System.out.println("task index is " + hash % numTasks); > > > > } > > > > > > There are certain types of values that may not hash > > consistently. If you are using String values, then it should be > > fine. Other types may or may not, depending on how the class > > implements hashCode(). > > > > > > -- > > Derek > > > > > > ________________________________ > > From: Kashyap Mhaisekar <[email protected] > > <mailto:[email protected]>> > > To: [email protected] <mailto:[email protected]> > > Sent: Tuesday, September 29, 2015 4:28 PM > > Subject: Field Group Hash Computation > > > > > > > > Hi, > > I have a field grouping based on 2 fields. I have 32 consumers > > for the tuple and I see most of the times, out of 64 bolts, the > > field group is always on 8 of them. Of the 8, 2 have more than > > 60% of the data. The data for the field grouping can have 20 > > different combinations. > > > > Do you know what is the way to compute the Hash of the fields > > used for computing? One of the groups mails indicate that the > > approach is - > > > > It calls "hashCode" on the list of selected values and mods it > > by the > > number of consumer tasks. You can play around with that function > > to see if > > something about your data is causing something degenerative to > > happen and > > cause skew > > > > I saw the clojure code but not sure how to understand this. > > > > Thanks > > Kashyap > > > > > > > >
