Reading and writing to external services in DoFns

Lars BK Fri, 28 Jul 2017 05:19:46 -0700

Hi everyone,


I'm researching how to handle a particular use case in Beam that I imagine
is common, but that I haven't been able to find any agreed upon best way of
doing yet.

*Use case: *I'm processing a stream or batch of records with ids, and for
each record I want to check whether I've ever seen its id before (beyond
the scope of the job execution). In particular, I'm going to be using
Google Dataflow, and I plan to store and look up ids in Google Datastore.

*Question*: Is it advisable to look up the record id in Datastore per
element in a DoFn? I am most worried about latency, and I am wary of the
recommendation in the documentation for ParDo
<https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/transforms/ParDo.html>
that says I'd have to be careful when I write to Datastore:

> "..if a DoFn's
<https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/transforms/DoFn.html>
execution has external side-effects, such as performing updates to external
HTTP services, then the DoFn's
<https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/transforms/DoFn.html>
code needs to take care to ensure that those updates are idempotent and
that concurrent updates are acceptable."

I found a relevant question on StackOverflow
<https://stackoverflow.com/questions/40049621/datastore-queries-in-dataflow-dofn-slow-down-pipeline-when-run-in-the-cloud>
where a user is doing something very similar to what I had in mind, and
another user says that:

> "For each partition of your PCollection the calls to Datastore are going
to be single-threaded, hence incur a lot of latency."

Is this something I should be worried about, and if so, does anyone know of
a better way? The second suggestion of the same user is to read all the ids
from Datastore and use a CoGroupByKey, I don't think that apporach that
would support streaming mode.


I hope somebody here has experience with similar patterns, and I'd greatly
appreciate any tips you could share!

Regards, Lars

Reading and writing to external services in DoFns

Reply via email to