Re: Decompressing Tar Files for Batch Processing

Austin Cawley-Edwards Tue, 07 Jul 2020 07:54:38 -0700

Hey Chesnay,

Thanks for the advice, and easy enough to do it in a separate process.


Best,
Austin

On Tue, Jul 7, 2020 at 10:29 AM Chesnay Schepler <ches...@apache.org> wrote:

> I would probably go with a separate process.
>
> Downloading the file could work with Flink if it is already present in
> some supported filesystem. Decompressing the file is supported for
> selected formats (deflate, gzip, bz2, xz), but this seems to be an
> undocumented feature, so I'm not sure how usable it is in reality.
>
> On 07/07/2020 01:30, Austin Cawley-Edwards wrote:
> > Hey all,
> >
> > I need to ingest a tar file containing ~1GB of data in around 10 CSVs.
> > The data is fairly connected and needs some cleaning, which I'd like
> > to do with the Batch Table API + SQL (but have never used before).
> > I've got a small prototype loading the uncompressed CSVs and applying
> > the necessary SQL, which works well.
> >
> > I'm wondering about the task of downloading the tar file and unzipping
> > it into the CSVs. Does this sound like something I can/ should do in
> > Flink, or should I set up another process to download, unzip, and
> > store in a filesystem to then read with the Flink Batch job? My
> > research is leading me towards doing it separately but I'd like to do
> > it all in the same job if there's a creative way.
> >
> > Thanks!
> > Austin
>
>
>

Re: Decompressing Tar Files for Batch Processing

Reply via email to