Hi, I would say that your assumption is correct and that the COMBINE strategy does in fact also depend on the ration " #total records/#records that fit into a single Sorter/Hashtable".
I'm CC'ing Fabian, just to be sure. He knows that stuff better than I do. Best, Aljoscha > On 31. Aug 2017, at 13:41, Urs Schoenenberger > <urs.schoenenber...@tngtech.com> wrote: > > Hi all, > > I was wondering about the heuristics for CombineHint: > > Flink uses SORT by default, but the doc for HASH says that we should > expect it to be faster if the number of keys is less than 1/10th of the > number of records. > > HASH should be faster if it is able to combine a lot of records, which > happens if multiple events for the same key are present in a data chunk > *that fits into a combine-hashtable* (cf handling in > ReduceCombineDriver.java). > > Now, if I have 10 billion events and 100 million keys, but only about 1 > million records fit into a hashtable, the number of matches may be > extremely low, so very few events are getting combined (of course, this > is similar for SORT as the sorter's memory is bounded, too). > > Am I correct in assuming that the actual tradeoff is not only based on > the ratio of #total records/#keys, but also on #total records/#records > that fit into a single Sorter/Hashtable? > > Thanks, > Urs > > -- > Urs Schönenberger - urs.schoenenber...@tngtech.com > > TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring > Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller > Sitz: Unterföhring * Amtsgericht München * HRB 135082