Hi Averell, In this case, I think you may need to extend Flink's existing source. First, read your tar.gz large file, when it been decompressed, use the multi-threaded ability to read the record in the source, and then parse the data format (map / flatmap might be a suitable operator, you can chain them with source because these two operator don't require data shuffle).
Note that Flink doesn't encourage creating extra threads in UDFs, but I don't know if there is a better way for this scenario. Thanks, vino. Averell <lvhu...@gmail.com> 于2018年8月10日周五 下午12:05写道: > Hi Fabian, Vino, > > I have one more question, which I initially planned to create a new thread, > but now I think it is better to ask here: > I need to process one big tar.gz file which contains multiple small gz > files. What is the best way to do this? I am thinking of having one single > thread process that read the TarArchiveStream (which has been decompressed > from that tar.gz by Flink automatically), and then distribute the > TarArchiveEntry entries to a multi-thread operator which would process the > small files in parallel. If this is feasible, which elements from Flink I > can reuse? > > Thanks a lot. > Regards, > Averell > > > > -- > Sent from: > http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >