Re: Small-files source - partitioning based on prefix of file

Fabian Hueske Tue, 31 Jul 2018 13:02:18 -0700

Hi Averell,

please find my answers inlined.


Best, Fabian

2018-07-31 13:52 GMT+02:00 Averell <lvhu...@gmail.com>:

> Hi Fabian,
>
> Thanks for the information. I will try to look at the change to that
> complex
> logic that you mentioned when I have time. That would save one more shuffle
> (from 1 to 0), wouldn't that?
>

I'm not 100% sure about that. You would need to implement it in a way that
you can use the "reinterpret as keyed stream" feature which is currently
experimental [1].
Not sure if that's possible.


>
> BTW, regarding fault tolerant in the file reader task, could you help
> explain what would happen if the reader task crash in the middle of reading
> one split? E.g: the split has 100 lines, and the reader crashed after
> reading 30 lines. What would happen when the operator gets resumed? Would
> those first 30 lines get reprocessed the 2nd time?
>
>
This depends on the implementation of the InputFormat. If it implements the
CheckpointableInputFormat interface, it is able to checkpoint the current
reading position in a split and can be reset to that position during
recovery.
In either case, some record will be read twice but if reading position can
be reset, you can still have exactly-once state consistency because the
state is reset as well.


> Those tens of thousands of files that I have are currently not in CSV
> format. Each file has some heading session of 10-20 lines (common data for
> the node), then data session with one CSV line for each record, then again
> some common data, and finally, a 2nd data session - one CSV line for each
> record.
> My current solution is to write a non-Flink job to preprocess those files
> and bring them to standard CSV format to be the input for Flink.
>

You can implement that with a custom FileInputFormat.


>
> I am thinking of doing this in Flink, with a custom file reader function
> which works in a similar way to wholeTextFile function in Spark batch
> processing. However, I don't know how to have fault tolerance in doing that
> yet.
>
> Thank you very much for your support.
>
> Regards,
> Averell
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>


[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.5/dev/stream/experimental.html#reinterpreting-a-pre-partitioned-data-stream-as-keyed-stream

Re: Small-files source - partitioning based on prefix of file

Reply via email to