Your approach with side-inputs sounds reasonable. I don't immediately see a simpler solution for this, as you need to depend on the runtime property of the input PCollection elements to produce the outputs of the join.
I would probably also look into whether you actually need the library implementation of leftOuterJoin or if it will be simpler for you to re-implement it. It's just a ParDo on top of CoGBK [1], and you will need to change that ParDo portion anyway. [1] https://github.com/apache/beam/blob/e2583f5e73de50f8af128ecaa331a2e1046d2b08/sdks/java/extensions/join-library/src/main/java/org/apache/beam/sdk/extensions/joinlibrary/Join.java#L101 On Thu, Dec 6, 2018 at 3:32 AM Joe Cullen <[email protected]> wrote: > Hi all, > > I am using a left join to join two collections: > > PCollection<KV<String, Map<String, String>>> p1 = ... > PCollection<KV<String, Map<String, String>>> p2 = ... > > Join.leftOuterJoin(p1, p2, Collections.emptyMap()) > > My question is: how do we provide a null value which matches the schema of > the p2 PCollection if we don't know the schema? > > If the Map in p1 consists of some keys including "a", and the Map in p2 > contains key "a" as well as some other keys (which are not in p1), what is > the best way to ensure the null value contains all the keys from the Map in > p2 (with empty String values for those keys)? My current idea is to pull > the keys from p2 before the join and use as a side input in a step after > the join, but I'm not sure if there is a simpler option. > > Thanks, > Joe >
