Use a stateful DoFn and buffer the elements in a bag state. You'll want to
use a key that contains enough data to match your join condition you are
trying to match. For example, if your trying to match on a customerId then
you would do something like:
element 1 -> ParDo(extract customer id) -> KV<customer id, element 1> ->
stateful ParDo(buffer element 1 in bag state)
...
element 5 -> ParDo(extract customer id) -> KV<customer id, element 5> ->
stateful ParDo(output all element in bag)

If you are matching on cudomerId and eventId then you would use a composite
key (customerId, eventId).

You can always use a single global key but you will lose all parallelism
during processing (for small pipelines this likely won't matter).

On Fri, Jun 26, 2020 at 7:29 AM Praveen K Viswanathan <
harish.prav...@gmail.com> wrote:

> Hi All - I have a DoFn which generates data (KV pair) for each element
> that it is processing. It also has to read from that KV for other elements
> based on a key which means, the KV has to retain all the data that's
> getting added to it while processing every element. I was thinking
> about the "slow-caching side input pattern" but it is more of caching
> outside the DoFn and then using it inside. It doesn't update the cache
> inside a DoFn. Please share if anyone has thoughts on how to approach this
> case.
>
> Element 1 > Add a record to a KV > ..... Element 5 > Used the value from
> KV if there is a match in the key
>
> --
> Thanks,
> Praveen K Viswanathan
>

Reply via email to