Hi Greg! That number should control the merge fan in, yes. Maybe a bug was introduced a while back that prevents this parameter from being properly passed through the system. Have you modified the config value in the cluster, on the client, or are you starting the job via the command line, in which case both are the same? In any case, we'll fix that soon, definitely. Could you open an issue for that?
Concerning the sub-optimal merging: You are right, this could be improved, like you said. Right mow, the attempt is to create uniform files, but your suggestion would be more efficient. The part is here in the code https://github.com/StephanEwen/incubator-flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/operators/sort/UnilateralSortMerger.java#L1400 Is this a critical issue for you? Would you be up for making a patch for this? It should be a fairly isolated change. Greetings, Stephan On Thu, Sep 3, 2015 at 3:02 AM, Greg Hogan <c...@greghogan.com> wrote: > When workers spill more than 128 files, I have seen these fully merged > into one or more much larger files. Does the following parameter allow more > files to be stored without requiring the intermediate merge-sort? I have > changed it to 1024 without effect. Also, it appears that the entire set of > small files is reprocessed rather than the minimum required to attain the > max fan-in (i.e., starting with 150 files, 23 would be merged leaving 128 > to be processed concurrently). > > taskmanager.runtime.max-fan: The maximal fan-in for external merge joins > and fan-out for spilling hash tables. Limits the number of file handles per > operator, but may cause intermediate merging/partitioning, if set too small > (DEFAULT: 128). > > Greg Hogan >