Re: Splitting in Stream Formats for File Source

2023-08-20 Thread Chirag Dewan via user
Thanks Ron. For HDFS, a reasonable level of parallelism is reading multiple blocks in parallel. Ofcourse that could mean losing the ordering that a file usually guarantees. Now if I understand correctly, this may become a problem in watermarking. But with smaller files having bounded high water

Re: Splitting in Stream Formats for File Source

2023-08-20 Thread liu ron
Hi, Regarding CSV and AvroParquet stream formats doens't supporting splits, I think some hints may be available from [1]. Personally, I think the main consideration should be the question of how the row format can find a reasonable split point, and how many Splits are appropriate to slice a file m

Splitting in Stream Formats for File Source

2023-08-16 Thread Chirag Dewan via user
Hi,I am trying to collect files from HDFS in my DataStream job. I need to collect two types of files - CSV and Parquet.  I understand that Flink supports both formats, but in Streaming mode, Flink doesnt support splitting these formats. Splitting is only supported in Table API. I wanted to under