Hmm, I'm trying to get the elements of set A which are not in set B. Set#comm(..) could work but seems like the wrong choice. I'm currently doing a left outer join and then filtering to the results with only left side values. Does that seem like the best choice or are there more gems hidden in the crunch library?
On Wed Feb 18 2015 at 4:55:29 PM Josh Wills <[email protected]> wrote: > If I got that right, then I think o.a.c.lib.Set does what you want. LMK. > > On Wed, Feb 18, 2015 at 2:53 PM, Josh Wills <[email protected]> wrote: > >> Oh, I'm dumb-- you mean you want like a left-join like thing where you >> can find all values in collection A that aren't in collection B, etc., etc.? >> >> J >> >> On Wed, Feb 18, 2015 at 2:43 PM, Josh Wills <[email protected]> wrote: >> >>> Different from o.a.c.lib.Cartesian.cross(PCollection<U> left, >>> PCollection<T> right, int parallelism) in some way? >>> >>> J >>> >>> On Wed, Feb 18, 2015 at 2:41 PM, Bryan Baugher <[email protected]> wrote: >>> >>>> >>>> Maybe, >>>> >>>> PCollection<T>#join(PCollection<T>, JoinType) : PCollection<Pair<T, T>> >>>> >>>> You could make additional methods for the different join strategies or >>>> maybe an enum perhaps? >>>> >>>> On Wed Feb 18 2015 at 3:58:38 PM Josh Wills <[email protected]> >>>> wrote: >>>> >>>>> Hey Bryan, >>>>> >>>>> I like the idea of throwing exceptions when there are null values in >>>>> one of the collections in a join. Not sure if there are any other >>>>> implications of that I should think through first. >>>>> >>>>> On the convenience methods for PCollection joins, what do you have in >>>>> mind? >>>>> >>>>> J >>>>> >>>>> >>>>> On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi everyone, >>>>>> >>>>>> The other day I ran into the issue mentioned here[1] about joining >>>>>> data with null values. This took awhile to figure out until I broke down >>>>>> and went to look at the docs to see if I was doing something obviously >>>>>> wrong. I used null values because I'm basically wanting to join two >>>>>> pcollections. >>>>>> >>>>>> Can crunch either throw an exception or log errors if I do something >>>>>> like this? Similarly would it be possible to get convenience methods for >>>>>> doing joins on PCollections? >>>>>> >>>>>> [1] - http://crunch.apache.org/user-guide.html#joins >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Director of Data Science >>>>> Cloudera <http://www.cloudera.com> >>>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>>>> >>>> >>> >>> >>> -- >>> Director of Data Science >>> Cloudera <http://www.cloudera.com> >>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>> >> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
