Caveat: you need to break fusion between list_of_images and
processed_images if you're using a runner that supports fusion, like
Dataflow.
See
https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion

On Wed, Apr 12, 2017 at 1:23 PM Sourabh Bajaj <[email protected]>
wrote:

> Hi,
>
> One idea could be creating a PCollection of the file names first and then
> in a DoFn processing each image as my guess is you probably don't want the
> PCollection to have sliced images.
>
> # Boilerplate python code for Beam
> pcollection = p | beam.Create([None])
> list_of_images = pcollection | beam.FlatMap(get_list_of_image_files)
> processed_images = list_of_images | beam.Map(process_image_given_filename)
>
> Does this fit your use case? The text sources try to work on one record so
> can parallelize by splitting a single source file.
>
> Thanks
> Sourabh
>
> On Wed, Apr 12, 2017 at 1:06 PM Tom Pollard <[email protected]>
> wrote:
>
> I have a large collection of images on GCS and was interested in trying to
> use Dataflow/BEAM to run analyses on these.  It looks like the existing IOs
> are all oriented towards textual data or structured data, and that there's
> no IO that makes the metadata on GCS storage objects available to a BEAM
> pipeline.  Is that the case, or am I missing something?
>
> *Tom Pollard*
> Senior Software Engineer
> *______________________________*
> *FLASHPOINT*
> e:  [email protected] <[email protected]>
> w: www.flashpoint-intel.com
>
> This email and any attachments are confidential and intended solely for
> the addressee(s) and may also be privileged or exempt from disclosure under
> applicable law. If you are not the addressee, or have received this email
> in error, please notify the sender immediately, delete it from your system
> and do not copy, distribute, disclose, or act upon any part of this email
> or its attachments.
>
>

Reply via email to