Hi everyone,
I'm researching how to handle a particular use case in Beam that I imagine is common, but that I haven't been able to find any agreed upon best way of doing yet. *Use case: *I'm processing a stream or batch of records with ids, and for each record I want to check whether I've ever seen its id before (beyond the scope of the job execution). In particular, I'm going to be using Google Dataflow, and I plan to store and look up ids in Google Datastore. *Question*: Is it advisable to look up the record id in Datastore per element in a DoFn? I am most worried about latency, and I am wary of the recommendation in the documentation for ParDo <https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/transforms/ParDo.html> that says I'd have to be careful when I write to Datastore: > "..if a DoFn's <https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/transforms/DoFn.html> execution has external side-effects, such as performing updates to external HTTP services, then the DoFn's <https://beam.apache.org/documentation/sdks/javadoc/2.0.0/org/apache/beam/sdk/transforms/DoFn.html> code needs to take care to ensure that those updates are idempotent and that concurrent updates are acceptable." I found a relevant question on StackOverflow <https://stackoverflow.com/questions/40049621/datastore-queries-in-dataflow-dofn-slow-down-pipeline-when-run-in-the-cloud> where a user is doing something very similar to what I had in mind, and another user says that: > "For each partition of your PCollection the calls to Datastore are going to be single-threaded, hence incur a lot of latency." Is this something I should be worried about, and if so, does anyone know of a better way? The second suggestion of the same user is to read all the ids from Datastore and use a CoGroupByKey, I don't think that apporach that would support streaming mode. I hope somebody here has experience with similar patterns, and I'd greatly appreciate any tips you could share! Regards, Lars
