Re: How avoid blocking when decompressing large GZIP files.

Evan Galpin Fri, 23 Apr 2021 04:50:51 -0700

I could be wrong but I believe that if your large file is being read by a
DoFn, it’s likely that the file is being processed atomically inside that
DoFn, which cannot be parallelized further by the runner.


One purpose-built way around that constraint is by using Splittable
DoFn[1][2] which could be used to allow each split to read a portion of the
file. I don’t know, however, how this might (or might not) work with
compression.

[1]
https://beam.apache.org/blog/splittable-do-fn-is-available/
[2]
https://beam.apache.org/documentation/programming-guide/#splittable-dofns

Thanks,
Evan

On Fri, Apr 23, 2021 at 07:34 Simon Gauld <[email protected]> wrote:

> Hello,
>
> I am trying to apply a transformation to each row in a reasonably large
> (1b row) gzip compressed CSV.
>
> The first operation is to assign a sequence number, in this case 1,2,3..
>
> The second operation is the actual transformation.
>
> I would like to apply the sequence number *as* each row is read from the
> compressed source and then hand off the 'real' transformation work in
> parallel, using DataFlow to autoscale the workers for the transformation.
>
> I don't seem to be able to scale *until* all rows have been read; this
> appears to be blocking the pipeline until decompression of the entire file
> is completed.   At this point DataFlow autoscaling works as expected, it
> scales upwards and throughput is then high. The issue is the decompression
> appears to block.
>
> My question: in beam, is it possible to stream records from a compressed
> source? without blocking the pipeline?
>
> thank you
>
> .s
>
>

Re: How avoid blocking when decompressing large GZIP files.

Reply via email to