Hi Shivam,

When you say "merge the PCollections" do you mean Flatten, or somehow join?
CoGroupByKey[1] would be a good choice if you need to join based on key.
You would then be able to implement application logic to keep 1 of the 2
records if there is a way to decipher an element from CollectionA vs.
CollectionB by only examining the elements.

If there isn't a natural way of determining which element to keep by only
examining the elements themselves, you could further nest the data in a KV
ex. If CollectionA holds data like KV<k1, v1> and CollectionB is KV<k1, v2>
you could transform these into something like KV<k1, KV<"COLLECTION_A",
v1>> and KV<k1, KV<"COLLECTION_B", v2>>. Then when you CoGroupByKey, these
elements would be grouped based on both having k1, and the source/origin
PCollection could be deciphered based on the key of the inner KV.

Thanks,
Evan

[1]
https://beam.apache.org/documentation/transforms/java/aggregation/cogroupbykey/

On Wed, Aug 10, 2022 at 3:25 PM Shivam Singhal <[email protected]>
wrote:

> I have two PCollections, CollectionA & CollectionB of type KV<String,
> Byte[]>.
>
>
> I would like to merge them into one PCollection but CollectionA &
> CollectionB might have some elements with the same key. In those repeated
> cases, I would like to keep the element from CollectionA & drop the
> repeated element from CollectionB.
>
> Does anyone know a simple method to do this?
>
> Thanks,
> Shivam Singhal
>

Reply via email to