Re: DataStream Batch Execution Mode and large files.

2021-05-18 Thread Marco Villalobos
--- > *Sender:*Marco Villalobos > *Send Date:*Wed May 19 09:50:45 2021 > *Recipients:*user > *Subject:*DataStream Batch Execution Mode and large files. > >> Hi, >> >> I am using the DataStream API in Batch Execution Mode, and my "source" is >> an

Re: DataStream Batch Execution Mode and large files.

2021-05-18 Thread Yun Gao
Hi Marco, With BATCH mode, all the ALL_TO_ALL edges would be marked as blocking and would use intermediate file to transfer data. Flink now support hash shuffle and sort shuffle for blocking edges[1], both of them stores the intermediate files in the directories configured by io.tmp.dirs[2].

DataStream Batch Execution Mode and large files.

2021-05-18 Thread Marco Villalobos
Hi, I am using the DataStream API in Batch Execution Mode, and my "source" is an s3 Buckets with about 500 GB of data spread across many files. Where does Flink stored the results of processed / produced data between tasks? There is no way that 500GB will fit in memory. So I am very curious how