Hey Chesnay, Thanks for the advice, and easy enough to do it in a separate process.
Best, Austin On Tue, Jul 7, 2020 at 10:29 AM Chesnay Schepler <ches...@apache.org> wrote: > I would probably go with a separate process. > > Downloading the file could work with Flink if it is already present in > some supported filesystem. Decompressing the file is supported for > selected formats (deflate, gzip, bz2, xz), but this seems to be an > undocumented feature, so I'm not sure how usable it is in reality. > > On 07/07/2020 01:30, Austin Cawley-Edwards wrote: > > Hey all, > > > > I need to ingest a tar file containing ~1GB of data in around 10 CSVs. > > The data is fairly connected and needs some cleaning, which I'd like > > to do with the Batch Table API + SQL (but have never used before). > > I've got a small prototype loading the uncompressed CSVs and applying > > the necessary SQL, which works well. > > > > I'm wondering about the task of downloading the tar file and unzipping > > it into the CSVs. Does this sound like something I can/ should do in > > Flink, or should I set up another process to download, unzip, and > > store in a filesystem to then read with the Flink Batch job? My > > research is leading me towards doing it separately but I'd like to do > > it all in the same job if there's a creative way. > > > > Thanks! > > Austin > > >