I could be wrong but I believe that if your large file is being read by a DoFn, it’s likely that the file is being processed atomically inside that DoFn, which cannot be parallelized further by the runner.
One purpose-built way around that constraint is by using Splittable DoFn[1][2] which could be used to allow each split to read a portion of the file. I don’t know, however, how this might (or might not) work with compression. [1] https://beam.apache.org/blog/splittable-do-fn-is-available/ [2] https://beam.apache.org/documentation/programming-guide/#splittable-dofns Thanks, Evan On Fri, Apr 23, 2021 at 07:34 Simon Gauld <[email protected]> wrote: > Hello, > > I am trying to apply a transformation to each row in a reasonably large > (1b row) gzip compressed CSV. > > The first operation is to assign a sequence number, in this case 1,2,3.. > > The second operation is the actual transformation. > > I would like to apply the sequence number *as* each row is read from the > compressed source and then hand off the 'real' transformation work in > parallel, using DataFlow to autoscale the workers for the transformation. > > I don't seem to be able to scale *until* all rows have been read; this > appears to be blocking the pipeline until decompression of the entire file > is completed. At this point DataFlow autoscaling works as expected, it > scales upwards and throughput is then high. The issue is the decompression > appears to block. > > My question: in beam, is it possible to stream records from a compressed > source? without blocking the pipeline? > > thank you > > .s > >
