Beam on Flink not processing input splits in parallel

Janek Bevendorff Tue, 08 Mar 2022 05:11:48 -0800

Hey there,

According to the docs, when using a FileBasedSource or a splittableDoFn, the runner is free to initiate splits that can be run in parallel.As far as I can tell, the former is actually happening on my ApacheFlink cluster, but the latter is not. This causes a single Taskmanagerto process all splits of an input text file. Is this known behaviour andhow can I fix this?


I have a pipeline that looks like this (Python SDK):

(pipeline
| 'Read Input File' >> textio.ReadFromText(input_glob, min_bundle_size=1)
| 'Reshuffle Lines' >> beam.Reshuffle()
| 'Map Records' >> beam.ParDo(map_func))

The input file is a large, uncompressed plaintext file from a shareddrive containing millions of newline-separated data records. I amrunning this job with a parallelism of 100, but it is bottlenecked by asingle worker running ReadFromText(). The reshuffling in between wasadded to force Beam/Flink to parallelize the processing, but this has noeffect on the preceding stage. Only the following map operation is beingrun in parallel. The stage itself is marked as having a parallelism of100, but 99 workers finish immediately.

I had the same issue earlier with another input source, in which I matcha bunch of WARC file globs and then iterate over them in a splittableDoFn. I solved the missing parallelism by adding an explicit reshufflein between matching input globs and actually reading the individual files:


class WarcInput(beam.PTransform):
    def expand(self, pcoll):

return pcoll | MatchFiles(self._file_pattern) |beam.Reshuffle() | beam.ParDo(WarcReader())

This way I can at least achieve parallelism on file level. This doesn'twork with a single splittable input file, of course, for which I wouldhave to reshuffle somewhere inside of ReadFromText(). Do I really haveto write a custom PTransform that generates initial splits, shufflesthem, and then reads from those splits? I consider this somewhatessential functionality.


Any hints appreciated.

Janek

Beam on Flink not processing input splits in parallel

Reply via email to