Hi, One idea could be creating a PCollection of the file names first and then in a DoFn processing each image as my guess is you probably don't want the PCollection to have sliced images.
# Boilerplate python code for Beam pcollection = p | beam.Create([None]) list_of_images = pcollection | beam.FlatMap(get_list_of_image_files) processed_images = list_of_images | beam.Map(process_image_given_filename) Does this fit your use case? The text sources try to work on one record so can parallelize by splitting a single source file. Thanks Sourabh On Wed, Apr 12, 2017 at 1:06 PM Tom Pollard <[email protected]> wrote: > I have a large collection of images on GCS and was interested in trying to > use Dataflow/BEAM to run analyses on these. It looks like the existing IOs > are all oriented towards textual data or structured data, and that there's > no IO that makes the metadata on GCS storage objects available to a BEAM > pipeline. Is that the case, or am I missing something? > > *Tom Pollard* > Senior Software Engineer > *______________________________* > *FLASHPOINT* > e: [email protected] <[email protected]> > w: www.flashpoint-intel.com > > This email and any attachments are confidential and intended solely for > the addressee(s) and may also be privileged or exempt from disclosure under > applicable law. If you are not the addressee, or have received this email > in error, please notify the sender immediately, delete it from your system > and do not copy, distribute, disclose, or act upon any part of this email > or its attachments. > >
