Hi everyone,

I'm researching how to handle a particular use case in Beam that I imagine
is common, but that I haven't been able to find any agreed upon best way of
doing yet.

*Use case: *I'm processing a stream or batch of records with ids, and for
each record I want to check whether I've ever seen its id before (beyond
the scope of the job execution). In particular, I'm going to be using
Google Dataflow, and I plan to store and look up ids in Google Datastore.

*Question*: Is it advisable to look up the record id in Datastore per
element in a DoFn? I am most worried about latency, and I am wary of the
recommendation in the documentation for ParDo
<https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/transforms/ParDo.html>
that says I'd have to be careful when I write to Datastore:

> "..if a DoFn's
<https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/transforms/DoFn.html>
execution has external side-effects, such as performing updates to external
HTTP services, then the DoFn's
<https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/transforms/DoFn.html>
code needs to take care to ensure that those updates are idempotent and
that concurrent updates are acceptable."

I found a relevant question on StackOverflow
<https://stackoverflow.com/questions/40049621/datastore-queries-in-dataflow-dofn-slow-down-pipeline-when-run-in-the-cloud>
where a user is doing something very similar to what I had in mind, and
another user says that:

> "For each partition of your PCollection the calls to Datastore are going
to be single-threaded, hence incur a lot of latency."

Is this something I should be worried about, and if so, does anyone know of
a better way? The second suggestion of the same user is to read all the ids
from Datastore and use a CoGroupByKey, I don't think that apporach that
would support streaming mode.


I hope somebody here has experience with similar patterns, and I'd greatly
appreciate any tips you could share!

Regards, Lars

Reply via email to