Hi, I am running Beam with the DataflowRunner and want to do 3 tasks:
1. Read an image from GCS 2. Process the image (data augmentation) 3. Serialize the image to a string I could do all this in a single DoFn, but I could also split it into these 3 stages. I don't know what would be better given the Beam model. Here are some thoughts: - Doing it in a single DoFn wastes concurrency e.g. one stage can be reading the image while the other does the processing. - Doing it in multiple DoFns might mean sending the images through the network, increasing latency. Sorry if these question are very basic. I am trying to get my head around this. The pipeline I currently have is processing about 15 imgs/sec which seems really slow, dataflow suggest that I increase some quotas to enable around 400 workers (is this an overkill?) Regards, Cristian
