Hi,

One idea could be creating a PCollection of the file names first and then
in a DoFn processing each image as my guess is you probably don't want the
PCollection to have sliced images.

# Boilerplate python code for Beam
pcollection = p | beam.Create([None])
list_of_images = pcollection | beam.FlatMap(get_list_of_image_files)
processed_images = list_of_images | beam.Map(process_image_given_filename)

Does this fit your use case? The text sources try to work on one record so
can parallelize by splitting a single source file.

Thanks
Sourabh

On Wed, Apr 12, 2017 at 1:06 PM Tom Pollard <[email protected]>
wrote:

> I have a large collection of images on GCS and was interested in trying to
> use Dataflow/BEAM to run analyses on these.  It looks like the existing IOs
> are all oriented towards textual data or structured data, and that there's
> no IO that makes the metadata on GCS storage objects available to a BEAM
> pipeline.  Is that the case, or am I missing something?
>
> *Tom Pollard*
> Senior Software Engineer
> *______________________________*
> *FLASHPOINT*
> e:  [email protected] <[email protected]>
> w: www.flashpoint-intel.com
>
> This email and any attachments are confidential and intended solely for
> the addressee(s) and may also be privileged or exempt from disclosure under
> applicable law. If you are not the addressee, or have received this email
> in error, please notify the sender immediately, delete it from your system
> and do not copy, distribute, disclose, or act upon any part of this email
> or its attachments.
>
>

Reply via email to