Python: Single vs Multiple DoFns for Image Processing

Cristian Garcia Thu, 21 Jun 2018 08:49:15 -0700

Hi,

I am running Beam with the DataflowRunner and want to do 3 tasks:


   1. Read an image from GCS
   2. Process the image (data augmentation)
   3. Serialize the image to a string

I could do all this in a single DoFn, but I could also split it into these
3 stages. I don't know what would be better given the Beam model. Here are
some thoughts:

   - Doing it in a single DoFn wastes concurrency e.g. one stage can be
   reading the image while the other does the processing.
   - Doing it in multiple DoFns might mean sending the images through the
   network, increasing latency.

Sorry if these question are very basic. I am trying to get my head around
this. The pipeline I currently have is processing about 15 imgs/sec which
seems really slow, dataflow suggest that I increase some quotas to enable
around 400 workers (is this an overkill?)

Regards,
Cristian

Python: Single vs Multiple DoFns for Image Processing

Reply via email to