Take a look at CompressedSource:
https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CompressedSource.java

I feel as though you could follow the same pattern to decompress/decrypt
the data as a wrapper.

Apache Beam supports a concept of dynamic work rebalancing (
https://beam.apache.org/blog/2016/05/18/splitAtFraction-method.html) which
allows a runner to utilize its fleet of machines more efficiently. For file
based sources, this would rely on being able to partition each input file
into smaller pieces for parallel processing. Unless that
compression/encryption method supports efficient seeking within the
uncompressed data, the smallest granularity of work rebalancing you will
have is at the whole file level (since decompressing/decrypting from the
beginning of the file to read an arbitrary offset is usually very
inefficient).

On Tue, Jun 20, 2017 at 6:00 AM, Sachin Shetty <sachin.she...@gmail.com>
wrote:

> Hi,
>
> I am trying to process some of our access logs using beam. Log files are
> archived to a GCS bucket, but they are lz compressed and encrypted using
> gpg.
>
> Any idea how I could load up the files in to a pipeline without
> decrypting, decompressing and staging the file before feeding it to a beam
> pipeline?
>
> I see that I could write a custom coder, but I could not figure our the
> specifics.
>
> Thanks
> Sachin
>

Reply via email to