Re: Dataflow isn't parallelizing
This seems to work! Thanks so much Eugene and Luke! On Fri, Sep 11, 2020 at 11:33 AM Luke Cwik wrote: > Inserting the Reshuffle is the easiest answer to test that parallelization > starts happening. > > If the performance is good but you're materializing too much data at the > shuffle boundary you'll want to convert your high fanout function (?Read > from Snowflake?) into a splittable DoFn. > > On Fri, Sep 11, 2020 at 9:56 AM Eugene Kirpichov > wrote: > >> Hi, >> >> Most likely this is because of fusion - see >> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization >> . You need to insert a Reshuffle.viaRandomKey(), most likely after the >> first step. >> >> On Fri, Sep 11, 2020 at 9:41 AM Alan Krumholz >> wrote: >> >>> Hi DataFlow team, >>> I have a simple pipeline that I'm trying to speed up using DataFlow: >>> >>> [image: image.png] >>> >>> As you can see the bottleneck is the "transcribe mp3" step. I was hoping >>> DataFlow would be able to run many of these in parallel to speed up the >>> total execution time. >>> >>> However it seems it doesn't do that... and instead keeps executing it >>> all independent inputs sequentially >>> Even when I tried to force it to start with many workers it rapidly >>> shuts down most of them and only keeps one alive and doesn't ever seem to >>> parallelize this step :( >>> >>> Any advice on what else to try to make it do this? >>> >>> Thanks so much! >>> >> >> >> -- >> Eugene Kirpichov >> http://www.linkedin.com/in/eugenekirpichov >> >
Re: Dataflow isn't parallelizing
Inserting the Reshuffle is the easiest answer to test that parallelization starts happening. If the performance is good but you're materializing too much data at the shuffle boundary you'll want to convert your high fanout function (?Read from Snowflake?) into a splittable DoFn. On Fri, Sep 11, 2020 at 9:56 AM Eugene Kirpichov wrote: > Hi, > > Most likely this is because of fusion - see > https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization > . You need to insert a Reshuffle.viaRandomKey(), most likely after the > first step. > > On Fri, Sep 11, 2020 at 9:41 AM Alan Krumholz > wrote: > >> Hi DataFlow team, >> I have a simple pipeline that I'm trying to speed up using DataFlow: >> >> [image: image.png] >> >> As you can see the bottleneck is the "transcribe mp3" step. I was hoping >> DataFlow would be able to run many of these in parallel to speed up the >> total execution time. >> >> However it seems it doesn't do that... and instead keeps executing it all >> independent inputs sequentially >> Even when I tried to force it to start with many workers it rapidly shuts >> down most of them and only keeps one alive and doesn't ever seem to >> parallelize this step :( >> >> Any advice on what else to try to make it do this? >> >> Thanks so much! >> > > > -- > Eugene Kirpichov > http://www.linkedin.com/in/eugenekirpichov >
Re: Dataflow isn't parallelizing
Hi, Most likely this is because of fusion - see https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization . You need to insert a Reshuffle.viaRandomKey(), most likely after the first step. On Fri, Sep 11, 2020 at 9:41 AM Alan Krumholz wrote: > Hi DataFlow team, > I have a simple pipeline that I'm trying to speed up using DataFlow: > > [image: image.png] > > As you can see the bottleneck is the "transcribe mp3" step. I was hoping > DataFlow would be able to run many of these in parallel to speed up the > total execution time. > > However it seems it doesn't do that... and instead keeps executing it all > independent inputs sequentially > Even when I tried to force it to start with many workers it rapidly shuts > down most of them and only keeps one alive and doesn't ever seem to > parallelize this step :( > > Any advice on what else to try to make it do this? > > Thanks so much! > -- Eugene Kirpichov http://www.linkedin.com/in/eugenekirpichov
Dataflow isn't parallelizing
Hi DataFlow team, I have a simple pipeline that I'm trying to speed up using DataFlow: [image: image.png] As you can see the bottleneck is the "transcribe mp3" step. I was hoping DataFlow would be able to run many of these in parallel to speed up the total execution time. However it seems it doesn't do that... and instead keeps executing it all independent inputs sequentially Even when I tried to force it to start with many workers it rapidly shuts down most of them and only keeps one alive and doesn't ever seem to parallelize this step :( Any advice on what else to try to make it do this? Thanks so much!