This seems to work!
Thanks so much Eugene and Luke! On Fri, Sep 11, 2020 at 11:33 AM Luke Cwik <[email protected]> wrote: > Inserting the Reshuffle is the easiest answer to test that parallelization > starts happening. > > If the performance is good but you're materializing too much data at the > shuffle boundary you'll want to convert your high fanout function (?Read > from Snowflake?) into a splittable DoFn. > > On Fri, Sep 11, 2020 at 9:56 AM Eugene Kirpichov <[email protected]> > wrote: > >> Hi, >> >> Most likely this is because of fusion - see >> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization >> . You need to insert a Reshuffle.viaRandomKey(), most likely after the >> first step. >> >> On Fri, Sep 11, 2020 at 9:41 AM Alan Krumholz <[email protected]> >> wrote: >> >>> Hi DataFlow team, >>> I have a simple pipeline that I'm trying to speed up using DataFlow: >>> >>> [image: image.png] >>> >>> As you can see the bottleneck is the "transcribe mp3" step. I was hoping >>> DataFlow would be able to run many of these in parallel to speed up the >>> total execution time. >>> >>> However it seems it doesn't do that... and instead keeps executing it >>> all independent inputs sequentially.... >>> Even when I tried to force it to start with many workers it rapidly >>> shuts down most of them and only keeps one alive and doesn't ever seem to >>> parallelize this step :( >>> >>> Any advice on what else to try to make it do this? >>> >>> Thanks so much! >>> >> >> >> -- >> Eugene Kirpichov >> http://www.linkedin.com/in/eugenekirpichov >> >
