Take a look at CompressedSource: https://github.com/apache/beam/blob/master/sdks/java/core/src/main/java/org/apache/beam/sdk/io/CompressedSource.java
I feel as though you could follow the same pattern to decompress/decrypt the data as a wrapper. Apache Beam supports a concept of dynamic work rebalancing ( https://beam.apache.org/blog/2016/05/18/splitAtFraction-method.html) which allows a runner to utilize its fleet of machines more efficiently. For file based sources, this would rely on being able to partition each input file into smaller pieces for parallel processing. Unless that compression/encryption method supports efficient seeking within the uncompressed data, the smallest granularity of work rebalancing you will have is at the whole file level (since decompressing/decrypting from the beginning of the file to read an arbitrary offset is usually very inefficient). On Tue, Jun 20, 2017 at 6:00 AM, Sachin Shetty <sachin.she...@gmail.com> wrote: > Hi, > > I am trying to process some of our access logs using beam. Log files are > archived to a GCS bucket, but they are lz compressed and encrypted using > gpg. > > Any idea how I could load up the files in to a pipeline without > decrypting, decompressing and staging the file before feeding it to a beam > pipeline? > > I see that I could write a custom coder, but I could not figure our the > specifics. > > Thanks > Sachin >