Re: Small-files source - partitioning based on prefix of file

vino yang Thu, 09 Aug 2018 22:08:07 -0700

Hi Averell,

In this case, I think you may need to extend Flink's existing source.
First, read your tar.gz large file, when it been decompressed, use the
multi-threaded ability to read the record in the source, and then parse the
data format (map / flatmap  might be a suitable operator, you can chain
them with source because these two operator don't require data shuffle).


Note that Flink doesn't encourage creating extra threads in UDFs, but I
don't know if there is a better way for this scenario.

Thanks, vino.

Averell <lvhu...@gmail.com> 于2018年8月10日周五 下午12:05写道：

> Hi Fabian, Vino,
>
> I have one more question, which I initially planned to create a new thread,
> but now I think it is better to ask here:
> I need to process one big tar.gz file which contains multiple small gz
> files. What is the best way to do this? I am thinking of having one single
> thread process that read the TarArchiveStream (which has been decompressed
> from that tar.gz by Flink automatically), and then distribute the
> TarArchiveEntry entries to a multi-thread operator which would process the
> small files in parallel. If this is feasible, which elements from Flink I
> can reuse?
>
> Thanks a lot.
> Regards,
> Averell
>
>
>
> --
> Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>

Re: Small-files source - partitioning based on prefix of file

Reply via email to