Hi, For my application I want to write files with records that are efficient in size and easy to read from my application code, so I want to write something like Parquet or Orc from a Beam application.
I found https://github.com/apache/beam/pull/1851 for Parquet and decided to try to make this actually work. While working on this I was confronted full impact of the choice that in Beam (DataFlow?) you cannot specify how many parallel instances you want of something. I found that formats like TextIO first write all the data to temporary files where a uuid is used in the filename and the all data is rewritten into the desired number of shards. Doing this for Parquet files seems to become a too big hurdle given the complexity of the file format. If you do not do this then you get a varying number of files where you actually need something like a UUID to ensure the files are unique. >From my perspective the core problem here is the fact that Beam does (in general) automatic scaling of steps in a flow, which is really good in most scenarios ... except in scenarios like this. I would like advice on how to proceed in this case. At this point I'm really tempted to switch back to Flink as there support for these files formats is readily available and works as expected. -- Best regards, Niels Basjes
