Re: Dataflow isn't parallelizing

Eugene Kirpichov Fri, 11 Sep 2020 09:56:18 -0700

Hi,

Most likely this is because of fusion - see
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization
. You need to insert a Reshuffle.viaRandomKey(), most likely after the
first step.


On Fri, Sep 11, 2020 at 9:41 AM Alan Krumholz <[email protected]>
wrote:

> Hi DataFlow team,
> I have a simple pipeline that I'm trying to speed up using DataFlow:
>
> [image: image.png]
>
> As you can see the bottleneck is the "transcribe mp3" step. I was hoping
> DataFlow would be able to run many of these in parallel to speed up the
> total execution time.
>
> However it seems it doesn't do that... and instead keeps executing it all
> independent inputs sequentially....
> Even when I tried to force it to start with many workers it rapidly shuts
> down most of them and only keeps one alive and doesn't ever seem to
> parallelize this step :(
>
> Any advice on what else to try to make it do this?
>
> Thanks so much!
>


-- 
Eugene Kirpichov
http://www.linkedin.com/in/eugenekirpichov

Re: Dataflow isn't parallelizing

Reply via email to