Thanks Ron.
For HDFS, a reasonable level of parallelism is reading multiple blocks in
parallel. Ofcourse that could mean losing the ordering that a file usually
guarantees. Now if I understand correctly, this may become a problem in
watermarking. But with smaller files having bounded high water
Hi,
Regarding CSV and AvroParquet stream formats doens't supporting splits, I
think some hints may be available from [1]. Personally, I think the main
consideration should be the question of how the row format can find a
reasonable split point, and how many Splits are appropriate to slice a file
m
Hi,I am trying to collect files from HDFS in my DataStream job. I need to
collect two types of files - CSV and Parquet.
I understand that Flink supports both formats, but in Streaming mode, Flink
doesnt support splitting these formats. Splitting is only supported in Table
API.
I wanted to under