Fellow beamers I am facing a problem regarding to the architecture of an ML pipeline. Batch mode.
I have several images I have to split between some TF Records, in order to train an AI. The problem is that the number of images may vary. Then I have no idea of number of shards upfront. 1) Is it possible that I give the number of images per TFRecord to the function "beam.io.tfrecordio.WriteToTFRecord()"? The place I save the TFRecords also is decided by the pipeline. 2) A solution I found is to have a cloud function processing this split, that takes about 50 seconds to process. Then, to each chunk I start a dataflow job, via template. This way I have around 40 jobs running simultaneously. Is this a problem? Is this bad practice? Thank you very much -- *ANDRÉ ROCHA SILVA* * DATA ENGINEER* (48) 3181-0611 <https://www.linkedin.com/in/andre-rocha-silva/> /andre-rocha-silva/ <http://portaltelemedicina.com.br/> <https://www.youtube.com/channel/UC0KH36-OXHFIKjlRY2GyAtQ> <https://pt-br.facebook.com/PortalTelemedicina/> <https://www.linkedin.com/company/9426084/>
