Hi, I ran a Beam+Flink+YARN job with many containers and a high " parallelism.default" parameter in my conf/flink-conf.yaml file.
That all worked perfectly with all the containers parallelizing all the parts of the job up until the very end at a TextIO.write(). The last Task running was a "DataSink (org.apache.flink.api.java.io.DiscardingOutputFormat)" during which I could finally actually observe output being written to HDFS. (The pipeline was in batch-mode reading from the file system, so I'm not entirely shocked writing output was saved until the end). The problem is: This Task used only one TaskManger/Container and ran all by itself for the last 15 minutes of the pipeline while all the other TaskManagers/Containers sat idle. How can I make sure that this gets parallelized the same as all the other Tasks? Should I address this through the Beam API or through some Flink configuration parameter I haven't found yet? Is it even possible to have multiple TaskManagers writing the TextIO output to HDFS at the same time? Thank you for your help, Chris
