Hi Anton,

Thanks for the suggestions, I'll try both.

Thanks!
Joe

On Thu, 6 Dec 2018, 17:55 Anton Kedin <[email protected] wrote:

> Your approach with side-inputs sounds reasonable. I don't immediately see
> a simpler solution for this, as you need to depend on the runtime property
> of the input PCollection elements to produce the outputs of the join.
>
> I would probably also look into whether you actually need the library
> implementation of leftOuterJoin or if it will be simpler for you to
> re-implement it. It's just a ParDo on top of CoGBK [1], and you will need
> to change that ParDo portion anyway.
>
> [1]
> https://github.com/apache/beam/blob/e2583f5e73de50f8af128ecaa331a2e1046d2b08/sdks/java/extensions/join-library/src/main/java/org/apache/beam/sdk/extensions/joinlibrary/Join.java#L101
>
> On Thu, Dec 6, 2018 at 3:32 AM Joe Cullen <[email protected]>
> wrote:
>
>> Hi all,
>>
>> I am using a left join to join two collections:
>>
>> PCollection<KV<String, Map<String, String>>> p1 = ...
>> PCollection<KV<String, Map<String, String>>> p2 = ...
>>
>> Join.leftOuterJoin(p1, p2, Collections.emptyMap())
>>
>> My question is: how do we provide a null value which matches the schema
>> of the p2 PCollection if we don't know the schema?
>>
>> If the Map in p1 consists of some keys including "a", and the Map in p2
>> contains key "a" as well as some other keys (which are not in p1), what is
>> the best way to ensure the null value contains all the keys from the Map in
>> p2 (with empty String values for those keys)? My current idea is to pull
>> the keys from p2 before the join and use as a side input in a step after
>> the join, but I'm not sure if there is a simpler option.
>>
>> Thanks,
>> Joe
>>
>

Reply via email to