Re: Dataflow isn't parallelizing

Alan Krumholz Fri, 11 Sep 2020 12:08:18 -0700

This seems to work!


Thanks so much Eugene and Luke!

On Fri, Sep 11, 2020 at 11:33 AM Luke Cwik <[email protected]> wrote:

> Inserting the Reshuffle is the easiest answer to test that parallelization
> starts happening.
>
> If the performance is good but you're materializing too much data at the
> shuffle boundary you'll want to convert your high fanout function (?Read
> from Snowflake?) into a splittable DoFn.
>
> On Fri, Sep 11, 2020 at 9:56 AM Eugene Kirpichov <[email protected]>
> wrote:
>
>> Hi,
>>
>> Most likely this is because of fusion - see
>> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization
>> . You need to insert a Reshuffle.viaRandomKey(), most likely after the
>> first step.
>>
>> On Fri, Sep 11, 2020 at 9:41 AM Alan Krumholz <[email protected]>
>> wrote:
>>
>>> Hi DataFlow team,
>>> I have a simple pipeline that I'm trying to speed up using DataFlow:
>>>
>>> [image: image.png]
>>>
>>> As you can see the bottleneck is the "transcribe mp3" step. I was hoping
>>> DataFlow would be able to run many of these in parallel to speed up the
>>> total execution time.
>>>
>>> However it seems it doesn't do that... and instead keeps executing it
>>> all independent inputs sequentially....
>>> Even when I tried to force it to start with many workers it rapidly
>>> shuts down most of them and only keeps one alive and doesn't ever seem to
>>> parallelize this step :(
>>>
>>> Any advice on what else to try to make it do this?
>>>
>>> Thanks so much!
>>>
>>
>>
>> --
>> Eugene Kirpichov
>> http://www.linkedin.com/in/eugenekirpichov
>>
>

Re: Dataflow isn't parallelizing

Reply via email to