You most definitely want Set.difference(setA, setB) ;

Sent from my T-Mobile 4G LTE Device


-------- Original message --------
From: Bryan Baugher
Date:02/18/2015 11:07 PM (GMT-05:00)
To: [email protected]
Subject: Re: Joins and null values

Hmm, I'm trying to get the elements of set A which are not in set B. 
Set#comm(..) could work but seems like the wrong choice. I'm currently doing a 
left outer join and then filtering to the results with only left side values. 
Does that seem like the best choice or are there more gems hidden in the crunch 
library?

On Wed Feb 18 2015 at 4:55:29 PM Josh Wills 
<[email protected]<mailto:[email protected]>> wrote:
If I got that right, then I think o.a.c.lib.Set does what you want. LMK.

On Wed, Feb 18, 2015 at 2:53 PM, Josh Wills 
<[email protected]<mailto:[email protected]>> wrote:
Oh, I'm dumb-- you mean you want like a left-join like thing where you can find 
all values in collection A that aren't in collection B, etc., etc.?

J

On Wed, Feb 18, 2015 at 2:43 PM, Josh Wills 
<[email protected]<mailto:[email protected]>> wrote:
Different from o.a.c.lib.Cartesian.cross(PCollection<U> left, PCollection<T> 
right, int parallelism) in some way?

J

On Wed, Feb 18, 2015 at 2:41 PM, Bryan Baugher 
<[email protected]<mailto:[email protected]>> wrote:

Maybe,

PCollection<T>#join(PCollection<T>, JoinType) : PCollection<Pair<T, T>>

You could make additional methods for the different join strategies or maybe an 
enum perhaps?

On Wed Feb 18 2015 at 3:58:38 PM Josh Wills 
<[email protected]<mailto:[email protected]>> wrote:
Hey Bryan,

I like the idea of throwing exceptions when there are null values in one of the 
collections in a join. Not sure if there are any other implications of that I 
should think through first.

On the convenience methods for PCollection joins, what do you have in mind?

J


On Wed, Feb 18, 2015 at 12:35 PM, Bryan Baugher 
<[email protected]<mailto:[email protected]>> wrote:
Hi everyone,

The other day I ran into the issue mentioned here[1] about joining data with 
null values. This took awhile to figure out until I broke down and went to look 
at the docs to see if I was doing something obviously wrong. I used null values 
because I'm basically wanting to join two pcollections.

Can crunch either throw an exception or log errors if I do something like this? 
Similarly would it be possible to get convenience methods for doing joins on 
PCollections?

[1] - http://crunch.apache.org/user-guide.html#joins



--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>



--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>



--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>



--
Director of Data Science
Cloudera<http://www.cloudera.com>
Twitter: @josh_wills<http://twitter.com/josh_wills>
This email is intended only for the use of the individual(s) to whom it is 
addressed. If you have received this communication in error, please immediately 
notify the sender and delete the original email.

Reply via email to