Hi Robert! I read the images from GCS using Tensorflows "FileIO" module. I am starting to realize that maybe the bottle neck is the machine type, I use some that have better CPUs to process the images.
Regards, Cristian On Thu, Jun 21, 2018 at 10:52 AM Robert Bradshaw <[email protected]> wrote: > I would write these as three separate DoFns; they will get fused together > to minimize IO. > > 400 workers may not be overkill, depending on how many images you have. Ia > dataflow not scaling up and sharing the work? Where is your list of images > coming from? > > > On Thu, Jun 21, 2018 at 8:49 AM Cristian Garcia <[email protected]> > wrote: > >> Hi, >> >> I am running Beam with the DataflowRunner and want to do 3 tasks: >> >> 1. Read an image from GCS >> 2. Process the image (data augmentation) >> 3. Serialize the image to a string >> >> I could do all this in a single DoFn, but I could also split it into >> these 3 stages. I don't know what would be better given the Beam model. >> Here are some thoughts: >> >> - Doing it in a single DoFn wastes concurrency e.g. one stage can be >> reading the image while the other does the processing. >> - Doing it in multiple DoFns might mean sending the images through >> the network, increasing latency. >> >> Sorry if these question are very basic. I am trying to get my head around >> this. The pipeline I currently have is processing about 15 imgs/sec which >> seems really slow, dataflow suggest that I increase some quotas to enable >> around 400 workers (is this an overkill?) >> >> Regards, >> Cristian >> >
