Hello everyone, At my company we are considering using Apache Beam as part of our Analytics system using the Python SDK.
Our dataset consists of an unbounded collection of TAR (gzipped) archives which contain several JSON and binary files. These TAR files need to be split into sub-categories, so, essentially outputting a new collection composed of smaller parts. Our transforms will operate over this second collection. The size of the compressed TAR archive files is around 10 MiB and the largest binary files we have are around 16 MiB. We only have a couple of these, the rest of the binary files are smaller than that. Also, in some cases, we may want some transformations to generate new binary files from this collection. The first problem I encountered is that there's no native way to extract TAR archives, so my first approach was to unpack the TAR in place (in a temporary directory) and then return the JSON files as objects and the binary files as bytes. But this crashes the Flink runner due to the large memory consumption. Is there a way to pass large binary files along each instance of the pipeline? I'm aware of fileio.py, I tried using WriteToFiles to write the unpacked binary files with no success. Apparently WriteToFiles groups all the files data into the same file. I'm also aware that I can implement my own IO transforms using FileBasedSource and FileBasedSink but it seems these classes are "record oriented" which is not very useful for us. Is Apache Beam the right framework for us? Can we implement our system using Beam? Thanks, Ignacio. -- This e-mail and any attachments may contain information that is privileged, confidential, and/or exempt from disclosure under applicable law. If you are not the intended recipient, you are hereby notified that any disclosure, copying, distribution or use of any information contained herein is strictly prohibited. If you have received this transmission in error, please immediately notify the sender and destroy the original transmission and any attachments, whether in electronic or hard copy format, without reading or saving.
