Hi,

I would say that your assumption is correct and that the COMBINE strategy does 
in fact also depend on the ration " #total records/#records that fit into a 
single Sorter/Hashtable".

I'm CC'ing Fabian, just to be sure. He knows that stuff better than I do.

Best,
Aljoscha

> On 31. Aug 2017, at 13:41, Urs Schoenenberger 
> <urs.schoenenber...@tngtech.com> wrote:
> 
> Hi all,
> 
> I was wondering about the heuristics for CombineHint:
> 
> Flink uses SORT by default, but the doc for HASH says that we should
> expect it to be faster if the number of keys is less than 1/10th of the
> number of records.
> 
> HASH should be faster if it is able to combine a lot of records, which
> happens if multiple events for the same key are present in a data chunk
> *that fits into a combine-hashtable* (cf handling in
> ReduceCombineDriver.java).
> 
> Now, if I have 10 billion events and 100 million keys, but only about 1
> million records fit into a hashtable, the number of matches may be
> extremely low, so very few events are getting combined (of course, this
> is similar for SORT as the sorter's memory is bounded, too).
> 
> Am I correct in assuming that the actual tradeoff is not only based on
> the ratio of #total records/#keys, but also on #total records/#records
> that fit into a single Sorter/Hashtable?
> 
> Thanks,
> Urs
> 
> -- 
> Urs Schönenberger - urs.schoenenber...@tngtech.com
> 
> TNG Technology Consulting GmbH, Betastr. 13a, 85774 Unterföhring
> Geschäftsführer: Henrik Klagges, Dr. Robert Dahlke, Gerhard Müller
> Sitz: Unterföhring * Amtsgericht München * HRB 135082

Reply via email to