Hi Averell,

One comment regarding what you said:

> As my files are small, I think there would not be much benefit in
checkpointing file offset state.

Checkpointing is not about efficiency but about consistency.
If the position in a split is not checkpointed, your application won't
operate with exactly-once state consistency unless each split produces
exactly one record.

Best, Fabian

2018-08-10 9:10 GMT+02:00 Jörn Franke <jornfra...@gmail.com>:

> Or you write a custom file system for Flink... (for  the tar part).
> Unfortunately gz files can only be processed single threaded (there are
> some multiple thread implementation but they don’t bring the big gain).
> On 10. Aug 2018, at 07:07, vino yang <yanghua1...@gmail.com> wrote:
> Hi Averell,
> In this case, I think you may need to extend Flink's existing source.
> First, read your tar.gz large file, when it been decompressed, use the
> multi-threaded ability to read the record in the source, and then parse the
> data format (map / flatmap  might be a suitable operator, you can chain
> them with source because these two operator don't require data shuffle).
> Note that Flink doesn't encourage creating extra threads in UDFs, but I
> don't know if there is a better way for this scenario.
> Thanks, vino.
> Averell <lvhu...@gmail.com> 于2018年8月10日周五 下午12:05写道:
>> Hi Fabian, Vino,
>> I have one more question, which I initially planned to create a new
>> thread,
>> but now I think it is better to ask here:
>> I need to process one big tar.gz file which contains multiple small gz
>> files. What is the best way to do this? I am thinking of having one single
>> thread process that read the TarArchiveStream (which has been decompressed
>> from that tar.gz by Flink automatically), and then distribute the
>> TarArchiveEntry entries to a multi-thread operator which would process the
>> small files in parallel. If this is feasible, which elements from Flink I
>> can reuse?
>> Thanks a lot.
>> Regards,
>> Averell
>> --
>> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
>> n4.nabble.com/

Reply via email to