Hi Anton, Thanks for the suggestions, I'll try both.
Thanks! Joe On Thu, 6 Dec 2018, 17:55 Anton Kedin <[email protected] wrote: > Your approach with side-inputs sounds reasonable. I don't immediately see > a simpler solution for this, as you need to depend on the runtime property > of the input PCollection elements to produce the outputs of the join. > > I would probably also look into whether you actually need the library > implementation of leftOuterJoin or if it will be simpler for you to > re-implement it. It's just a ParDo on top of CoGBK [1], and you will need > to change that ParDo portion anyway. > > [1] > https://github.com/apache/beam/blob/e2583f5e73de50f8af128ecaa331a2e1046d2b08/sdks/java/extensions/join-library/src/main/java/org/apache/beam/sdk/extensions/joinlibrary/Join.java#L101 > > On Thu, Dec 6, 2018 at 3:32 AM Joe Cullen <[email protected]> > wrote: > >> Hi all, >> >> I am using a left join to join two collections: >> >> PCollection<KV<String, Map<String, String>>> p1 = ... >> PCollection<KV<String, Map<String, String>>> p2 = ... >> >> Join.leftOuterJoin(p1, p2, Collections.emptyMap()) >> >> My question is: how do we provide a null value which matches the schema >> of the p2 PCollection if we don't know the schema? >> >> If the Map in p1 consists of some keys including "a", and the Map in p2 >> contains key "a" as well as some other keys (which are not in p1), what is >> the best way to ensure the null value contains all the keys from the Map in >> p2 (with empty String values for those keys)? My current idea is to pull >> the keys from p2 before the join and use as a side input in a step after >> the join, but I'm not sure if there is a simpler option. >> >> Thanks, >> Joe >> >
