When workers spill more than 128 files, I have seen these fully merged into
one or more much larger files. Does the following parameter allow more
files to be stored without requiring the intermediate merge-sort? I have
changed it to 1024 without effect. Also, it appears that the entire set of
small files is reprocessed rather than the minimum required to attain the
max fan-in (i.e., starting with 150 files, 23 would be merged leaving 128
to be processed concurrently).

taskmanager.runtime.max-fan: The maximal fan-in for external merge joins
and fan-out for spilling hash tables. Limits the number of file handles per
operator, but may cause intermediate merging/partitioning, if set too small
(DEFAULT: 128).

Greg Hogan

Reply via email to