Hi, Most likely this is because of fusion - see https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization . You need to insert a Reshuffle.viaRandomKey(), most likely after the first step.
On Fri, Sep 11, 2020 at 9:41 AM Alan Krumholz <[email protected]> wrote: > Hi DataFlow team, > I have a simple pipeline that I'm trying to speed up using DataFlow: > > [image: image.png] > > As you can see the bottleneck is the "transcribe mp3" step. I was hoping > DataFlow would be able to run many of these in parallel to speed up the > total execution time. > > However it seems it doesn't do that... and instead keeps executing it all > independent inputs sequentially.... > Even when I tried to force it to start with many workers it rapidly shuts > down most of them and only keeps one alive and doesn't ever seem to > parallelize this step :( > > Any advice on what else to try to make it do this? > > Thanks so much! > -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov
