Re: Python: Single vs Multiple DoFns for Image Processing

Cristian Garcia Fri, 22 Jun 2018 16:02:28 -0700

Hi Robert!

I read the images from GCS using Tensorflows "FileIO" module.
I am starting to realize that maybe the bottle neck is the machine type, I
use some that have better CPUs to process the images.


Regards,
Cristian

On Thu, Jun 21, 2018 at 10:52 AM Robert Bradshaw <[email protected]>
wrote:

> I would write these as three separate DoFns; they will get fused together
> to minimize IO.
>
> 400 workers may not be overkill, depending on how many images you have. Ia
> dataflow not scaling up and sharing the work? Where is your list of images
> coming from?
>
>
> On Thu, Jun 21, 2018 at 8:49 AM Cristian Garcia <[email protected]>
> wrote:
>
>> Hi,
>>
>> I am running Beam with the DataflowRunner and want to do 3 tasks:
>>
>>    1. Read an image from GCS
>>    2. Process the image (data augmentation)
>>    3. Serialize the image to a string
>>
>> I could do all this in a single DoFn, but I could also split it into
>> these 3 stages. I don't know what would be better given the Beam model.
>> Here are some thoughts:
>>
>>    - Doing it in a single DoFn wastes concurrency e.g. one stage can be
>>    reading the image while the other does the processing.
>>    - Doing it in multiple DoFns might mean sending the images through
>>    the network, increasing latency.
>>
>> Sorry if these question are very basic. I am trying to get my head around
>> this. The pipeline I currently have is processing about 15 imgs/sec which
>> seems really slow, dataflow suggest that I increase some quotas to enable
>> around 400 workers (is this an overkill?)
>>
>> Regards,
>> Cristian
>>
>

Re: Python: Single vs Multiple DoFns for Image Processing

Reply via email to