Your approach with side-inputs sounds reasonable. I don't immediately see a
simpler solution for this, as you need to depend on the runtime property of
the input PCollection elements to produce the outputs of the join.

I would probably also look into whether you actually need the library
implementation of leftOuterJoin or if it will be simpler for you to
re-implement it. It's just a ParDo on top of CoGBK [1], and you will need
to change that ParDo portion anyway.

[1]
https://github.com/apache/beam/blob/e2583f5e73de50f8af128ecaa331a2e1046d2b08/sdks/java/extensions/join-library/src/main/java/org/apache/beam/sdk/extensions/joinlibrary/Join.java#L101

On Thu, Dec 6, 2018 at 3:32 AM Joe Cullen <[email protected]>
wrote:

> Hi all,
>
> I am using a left join to join two collections:
>
> PCollection<KV<String, Map<String, String>>> p1 = ...
> PCollection<KV<String, Map<String, String>>> p2 = ...
>
> Join.leftOuterJoin(p1, p2, Collections.emptyMap())
>
> My question is: how do we provide a null value which matches the schema of
> the p2 PCollection if we don't know the schema?
>
> If the Map in p1 consists of some keys including "a", and the Map in p2
> contains key "a" as well as some other keys (which are not in p1), what is
> the best way to ensure the null value contains all the keys from the Map in
> p2 (with empty String values for those keys)? My current idea is to pull
> the keys from p2 before the join and use as a side input in a step after
> the join, but I'm not sure if there is a simpler option.
>
> Thanks,
> Joe
>

Reply via email to