If I got that right, then I think o.a.c.lib.Set does what you want. LMK. On Wed, Feb 18, 2015 at 2:53 PM, Josh Wills <[email protected]> wrote:
> Oh, I'm dumb-- you mean you want like a left-join like thing where you can > find all values in collection A that aren't in collection B, etc., etc.? > > J > > On Wed, Feb 18, 2015 at 2:43 PM, Josh Wills <[email protected]> wrote: > >> Different from o.a.c.lib.Cartesian.cross(PCollection<U> left, >> PCollection<T> right, int parallelism) in some way? >> >> J >> >> On Wed, Feb 18, 2015 at 2:41 PM, Bryan Baugher <[email protected]> wrote: >> >>> >>> Maybe, >>> >>> PCollection<T>#join(PCollection<T>, JoinType) : PCollection<Pair<T, T>> >>> >>> You could make additional methods for the different join strategies or >>> maybe an enum perhaps? >>> >>> On Wed Feb 18 2015 at 3:58:38 PM Josh Wills <[email protected]> wrote: >>> >>>> Hey Bryan, >>>> >>>> I like the idea of throwing exceptions when there are null values in >>>> one of the collections in a join. Not sure if there are any other >>>> implications of that I should think through first. >>>> >>>> On the convenience methods for PCollection joins, what do you have in >>>> mind? >>>> >>>> J >>>> >>>> >>>> On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher <[email protected]> >>>> wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> The other day I ran into the issue mentioned here[1] about joining >>>>> data with null values. This took awhile to figure out until I broke down >>>>> and went to look at the docs to see if I was doing something obviously >>>>> wrong. I used null values because I'm basically wanting to join two >>>>> pcollections. >>>>> >>>>> Can crunch either throw an exception or log errors if I do something >>>>> like this? Similarly would it be possible to get convenience methods for >>>>> doing joins on PCollections? >>>>> >>>>> [1] - http://crunch.apache.org/user-guide.html#joins >>>>> >>>> >>>> >>>> >>>> -- >>>> Director of Data Science >>>> Cloudera <http://www.cloudera.com> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>>> >>> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
