Hi Fabian, Thanks for the information. I will try to look at the change to that complex logic that you mentioned when I have time. That would save one more shuffle (from 1 to 0), wouldn't that?
BTW, regarding fault tolerant in the file reader task, could you help explain what would happen if the reader task crash in the middle of reading one split? E.g: the split has 100 lines, and the reader crashed after reading 30 lines. What would happen when the operator gets resumed? Would those first 30 lines get reprocessed the 2nd time? Those tens of thousands of files that I have are currently not in CSV format. Each file has some heading session of 10-20 lines (common data for the node), then data session with one CSV line for each record, then again some common data, and finally, a 2nd data session - one CSV line for each record. My current solution is to write a non-Flink job to preprocess those files and bring them to standard CSV format to be the input for Flink. I am thinking of doing this in Flink, with a custom file reader function which works in a similar way to wholeTextFile function in Spark batch processing. However, I don't know how to have fault tolerance in doing that yet. Thank you very much for your support. Regards, Averell -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/