Re: Dataflow isn't parallelizing

2020-09-11 Thread Alan Krumholz
This seems to work!


Thanks so much Eugene and Luke!

On Fri, Sep 11, 2020 at 11:33 AM Luke Cwik  wrote:

> Inserting the Reshuffle is the easiest answer to test that parallelization
> starts happening.
>
> If the performance is good but you're materializing too much data at the
> shuffle boundary you'll want to convert your high fanout function (?Read
> from Snowflake?) into a splittable DoFn.
>
> On Fri, Sep 11, 2020 at 9:56 AM Eugene Kirpichov 
> wrote:
>
>> Hi,
>>
>> Most likely this is because of fusion - see
>> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization
>> . You need to insert a Reshuffle.viaRandomKey(), most likely after the
>> first step.
>>
>> On Fri, Sep 11, 2020 at 9:41 AM Alan Krumholz 
>> wrote:
>>
>>> Hi DataFlow team,
>>> I have a simple pipeline that I'm trying to speed up using DataFlow:
>>>
>>> [image: image.png]
>>>
>>> As you can see the bottleneck is the "transcribe mp3" step. I was hoping
>>> DataFlow would be able to run many of these in parallel to speed up the
>>> total execution time.
>>>
>>> However it seems it doesn't do that... and instead keeps executing it
>>> all independent inputs sequentially
>>> Even when I tried to force it to start with many workers it rapidly
>>> shuts down most of them and only keeps one alive and doesn't ever seem to
>>> parallelize this step :(
>>>
>>> Any advice on what else to try to make it do this?
>>>
>>> Thanks so much!
>>>
>>
>>
>> --
>> Eugene Kirpichov
>> http://www.linkedin.com/in/eugenekirpichov
>>
>


Re: Dataflow isn't parallelizing

2020-09-11 Thread Luke Cwik
Inserting the Reshuffle is the easiest answer to test that parallelization
starts happening.

If the performance is good but you're materializing too much data at the
shuffle boundary you'll want to convert your high fanout function (?Read
from Snowflake?) into a splittable DoFn.

On Fri, Sep 11, 2020 at 9:56 AM Eugene Kirpichov 
wrote:

> Hi,
>
> Most likely this is because of fusion - see
> https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization
> . You need to insert a Reshuffle.viaRandomKey(), most likely after the
> first step.
>
> On Fri, Sep 11, 2020 at 9:41 AM Alan Krumholz 
> wrote:
>
>> Hi DataFlow team,
>> I have a simple pipeline that I'm trying to speed up using DataFlow:
>>
>> [image: image.png]
>>
>> As you can see the bottleneck is the "transcribe mp3" step. I was hoping
>> DataFlow would be able to run many of these in parallel to speed up the
>> total execution time.
>>
>> However it seems it doesn't do that... and instead keeps executing it all
>> independent inputs sequentially
>> Even when I tried to force it to start with many workers it rapidly shuts
>> down most of them and only keeps one alive and doesn't ever seem to
>> parallelize this step :(
>>
>> Any advice on what else to try to make it do this?
>>
>> Thanks so much!
>>
>
>
> --
> Eugene Kirpichov
> http://www.linkedin.com/in/eugenekirpichov
>


Re: Dataflow isn't parallelizing

2020-09-11 Thread Eugene Kirpichov
Hi,

Most likely this is because of fusion - see
https://cloud.google.com/dataflow/docs/guides/deploying-a-pipeline#fusion-optimization
. You need to insert a Reshuffle.viaRandomKey(), most likely after the
first step.

On Fri, Sep 11, 2020 at 9:41 AM Alan Krumholz 
wrote:

> Hi DataFlow team,
> I have a simple pipeline that I'm trying to speed up using DataFlow:
>
> [image: image.png]
>
> As you can see the bottleneck is the "transcribe mp3" step. I was hoping
> DataFlow would be able to run many of these in parallel to speed up the
> total execution time.
>
> However it seems it doesn't do that... and instead keeps executing it all
> independent inputs sequentially
> Even when I tried to force it to start with many workers it rapidly shuts
> down most of them and only keeps one alive and doesn't ever seem to
> parallelize this step :(
>
> Any advice on what else to try to make it do this?
>
> Thanks so much!
>


-- 
Eugene Kirpichov
http://www.linkedin.com/in/eugenekirpichov


Dataflow isn't parallelizing

2020-09-11 Thread Alan Krumholz
Hi DataFlow team,
I have a simple pipeline that I'm trying to speed up using DataFlow:

[image: image.png]

As you can see the bottleneck is the "transcribe mp3" step. I was hoping
DataFlow would be able to run many of these in parallel to speed up the
total execution time.

However it seems it doesn't do that... and instead keeps executing it all
independent inputs sequentially
Even when I tried to force it to start with many workers it rapidly shuts
down most of them and only keeps one alive and doesn't ever seem to
parallelize this step :(

Any advice on what else to try to make it do this?

Thanks so much!


Re: Installing ffmpeg on a python dataflow job

2020-09-11 Thread Alan Krumholz
Hi Luke,
I copied the setup.py file from the docs without modifying it much and it
worked now.

Thank you

On Thu, Sep 10, 2020 at 11:12 AM Luke Cwik  wrote:

> Did ffmpeg install change $PATH? (this may not be visible to the current
> process)
> Have you tried the full path to the executable?
>
> On Wed, Sep 9, 2020 at 1:48 PM Alan Krumholz 
> wrote:
>
>> Hi DataFlow team,
>>
>> We are trying to use ffmpeg to process some video data using dataflow.
>> In order to do this we need the worker nodes to have ffmpeg installed.
>>
>> After reading Beam docs I created a setup.py file for my job like this:
>>
>> #!/usr/bin/python
>> import subprocess
>> from distutils.command.build import build as _build
>> import setuptools
>>
>> class build(_build):
>> sub_commands = _build.sub_commands + [('CustomCommands', None)]
>>
>> class CustomCommands(setuptools.Command):
>> def initialize_options(self):
>> pass
>>
>> def finalize_options(self):
>> pass
>>
>> def RunCustomCommand(self, command_list):
>> p = subprocess.Popen(
>> command_list,
>> stdin=subprocess.PIPE,
>> stdout=subprocess.PIPE,
>> stderr=subprocess.STDOUT)
>> stdout_data, _ = p.communicate()
>> if p.returncode != 0:
>> raise RuntimeError(
>> 'Command %s failed: exit code: %s' % (
>> command_list, p.returncode))
>>
>> def run(self):
>> for command in CUSTOM_COMMANDS:
>> self.RunCustomCommand(command)
>>
>> CUSTOM_COMMANDS = [
>> ['apt-get', 'update'],
>> ['apt-get', 'install', '-y', 'ffmpeg']]
>> REQUIRED_PACKAGES = [
>> 'boto3==1.11.17',
>> 'ffmpeg-python==0.2.0',
>> 'google-cloud-storage==1.31.0']
>> setuptools.setup(
>> name='DataflowJob',
>> version='0.1',
>> install_requires=REQUIRED_PACKAGES,
>> packages=setuptools.find_packages(),
>> mdclass={
>> 'build': build,
>> 'CustomCommands': CustomCommands})
>>
>> However, when I run the job I still get an error saying that ffmpeg is
>> not installed: "No such file or directory: 'ffmpeg'"
>>
>> Any clue what am I doing wrong?
>>
>> Thanks so much!
>>
>>