Hi,

I ran a Beam+Flink+YARN job with many containers and a high "
parallelism.default" parameter in my conf/flink-conf.yaml file.

That all worked perfectly with all the containers parallelizing all the
parts of the job up until the very end at a TextIO.write().

The last Task running was a "DataSink
(org.apache.flink.api.java.io.DiscardingOutputFormat)" during which I could
finally actually observe output being written to HDFS. (The pipeline was in
batch-mode reading from the file system, so I'm not entirely shocked
writing output was saved until the end).

The problem is: This Task used only one TaskManger/Container and ran all by
itself for the last 15 minutes of the pipeline while all the other
TaskManagers/Containers sat idle.

How can I make sure that this gets parallelized the same as all the other
Tasks?

Should I address this through the Beam API or through some Flink
configuration parameter I haven't found yet?

Is it even possible to have multiple TaskManagers writing the TextIO output
to HDFS at the same time?

Thank you for your help,
Chris

Reply via email to