Hi Fabian,

Thanks for the information. I will try to look at the change to that complex
logic that you mentioned when I have time. That would save one more shuffle
(from 1 to 0), wouldn't that?

BTW, regarding fault tolerant in the file reader task, could you help
explain what would happen if the reader task crash in the middle of reading
one split? E.g: the split has 100 lines, and the reader crashed after
reading 30 lines. What would happen when the operator gets resumed? Would
those first 30 lines get reprocessed the 2nd time?

Those tens of thousands of files that I have are currently not in CSV
format. Each file has some heading session of 10-20 lines (common data for
the node), then data session with one CSV line for each record, then again
some common data, and finally, a 2nd data session - one CSV line for each
record.
My current solution is to write a non-Flink job to preprocess those files
and bring them to standard CSV format to be the input for Flink.

I am thinking of doing this in Flink, with a custom file reader function
which works in a similar way to wholeTextFile function in Spark batch
processing. However, I don't know how to have fault tolerance in doing that
yet.

Thank you very much for your support.

Regards,
Averell 



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Reply via email to