Ahh yes reading the whole doc would help. Thanks! On Wed Feb 18 2015 at 10:38:56 PM David Ortiz <[email protected]> wrote:
> You most definitely want Set.difference(setA, setB) ; > > > Sent from my T-Mobile 4G LTE Device > > > -------- Original message -------- > From: Bryan Baugher > Date:02/18/2015 11:07 PM (GMT-05:00) > To: [email protected] > Subject: Re: Joins and null values > > Hmm, I'm trying to get the elements of set A which are not in set B. > Set#comm(..) could work but seems like the wrong choice. I'm currently > doing a left outer join and then filtering to the results with only left > side values. Does that seem like the best choice or are there more gems > hidden in the crunch library? > > On Wed Feb 18 2015 at 4:55:29 PM Josh Wills <[email protected]> wrote: > >> If I got that right, then I think o.a.c.lib.Set does what you want. LMK. >> >> On Wed, Feb 18, 2015 at 2:53 PM, Josh Wills <[email protected]> wrote: >> >>> Oh, I'm dumb-- you mean you want like a left-join like thing where you >>> can find all values in collection A that aren't in collection B, etc., etc.? >>> >>> J >>> >>> On Wed, Feb 18, 2015 at 2:43 PM, Josh Wills <[email protected]> wrote: >>> >>>> Different from o.a.c.lib.Cartesian.cross(PCollection<U> left, >>>> PCollection<T> right, int parallelism) in some way? >>>> >>>> J >>>> >>>> On Wed, Feb 18, 2015 at 2:41 PM, Bryan Baugher <[email protected]> >>>> wrote: >>>> >>>>> >>>>> Maybe, >>>>> >>>>> PCollection<T>#join(PCollection<T>, JoinType) : PCollection<Pair<T, >>>>> T>> >>>>> >>>>> You could make additional methods for the different join strategies >>>>> or maybe an enum perhaps? >>>>> >>>>> On Wed Feb 18 2015 at 3:58:38 PM Josh Wills <[email protected]> >>>>> wrote: >>>>> >>>>>> Hey Bryan, >>>>>> >>>>>> I like the idea of throwing exceptions when there are null values >>>>>> in one of the collections in a join. Not sure if there are any other >>>>>> implications of that I should think through first. >>>>>> >>>>>> On the convenience methods for PCollection joins, what do you have >>>>>> in mind? >>>>>> >>>>>> J >>>>>> >>>>>> >>>>>> On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hi everyone, >>>>>>> >>>>>>> The other day I ran into the issue mentioned here[1] about joining >>>>>>> data with null values. This took awhile to figure out until I broke down >>>>>>> and went to look at the docs to see if I was doing something obviously >>>>>>> wrong. I used null values because I'm basically wanting to join two >>>>>>> pcollections. >>>>>>> >>>>>>> Can crunch either throw an exception or log errors if I do >>>>>>> something like this? Similarly would it be possible to get convenience >>>>>>> methods for doing joins on PCollections? >>>>>>> >>>>>>> [1] - http://crunch.apache.org/user-guide.html#joins >>>>>>> >>>>>> >>>>>> >>>>>> >>>>>> -- >>>>>> Director of Data Science >>>>>> Cloudera <http://www.cloudera.com> >>>>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> Director of Data Science >>>> Cloudera <http://www.cloudera.com> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>>> >>> >>> >>> >>> -- >>> Director of Data Science >>> Cloudera <http://www.cloudera.com> >>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>> >> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > *This email is intended only for the use of the individual(s) to whom it > is addressed. If you have received this communication in error, please > immediately notify the sender and delete the original email.* >
