Re: .intersection() method on RDDs?

Evan Sparks Thu, 23 Jan 2014 22:50:57 -0800

If the intersection is really big, would join be better?

Agreed on "null" vs None -but how frequent is this in the current codebase?


> On Jan 23, 2014, at 10:38 PM, Matei Zaharia <[email protected]> wrote:
> 
> You’d have to add a filter after the cogroup too. Cogroup gives you (key, 
> (list of values in RDD 1, list in RDD 2)).
> 
> Also one small thing, instead of setting the value to None, it may be cheaper 
> to use null.
> 
> Matei
> 
>> On Jan 23, 2014, at 10:30 PM, Andrew Ash <[email protected]> wrote:
>> 
>> You mean cogroup like this?
>> 
>> A.map(v => (v,None)).cogroup(B.map(v => (v,None))).keys
>> 
>> If so I might send a PR to start code review for getting this into master.
>> 
>> Good to know about the strategy for sharding RDDs and for the core 
>> operations.
>> 
>> Thanks!
>> Andrew
>> 
>> 
>>> On Thu, Jan 23, 2014 at 11:17 PM, Matei Zaharia <[email protected]> 
>>> wrote:
>>> Using cogroup would probably be slightly more efficient than join because 
>>> you don’t have to generate every pair of keys for elements that occur in 
>>> each dataset multiple times.
>>> 
>>> We haven’t tried to explicitly separate the API between “core” methods and 
>>> others, but in practice, everything can be built on mapPartitions and 
>>> cogroup for transformations, and SparkContext.runJob (internal method) for 
>>> actions. What really matters is actually the level at which the code sees 
>>> dependencies in the DAGScheduler, which is done through the Dependency 
>>> class. There are only two types of dependencies (narrow and shuffle), which 
>>> correspond to those operations above. So in a sense there is this 
>>> separation at the lowest level. But for the levels above, the goal was 
>>> first and foremost to make the API as usable as possible, which meant 
>>> giving people quick access to all the operations that might be useful, and 
>>> dealing with how we’ll implement those later. Over time it will be possible 
>>> to divide things like RDD.scala into multiple traits if they become 
>>> unwieldy.
>>> 
>>> Matei
>>> 
>>> 
>>>> On Jan 23, 2014, at 9:40 PM, Andrew Ash <[email protected]> wrote:
>>>> 
>>>> And I think the followup to Ian's question:
>>>> 
>>>> Is there a way to implement .intersect() in the core API that's more 
>>>> efficient than the .join() method Evan suggested?
>>>> 
>>>> Andrew
>>>> 
>>>> 
>>>>> On Thu, Jan 23, 2014 at 10:26 PM, Ian O'Connell <[email protected]> 
>>>>> wrote:
>>>>> Is there any separation in the API between functions that can be built 
>>>>> solely on the existing exposed public API and ones which require access 
>>>>> to internals?
>>>>> 
>>>>> Just to maybe avoid bloat for composite functions like this that are for 
>>>>> user convenience?
>>>>> 
>>>>> (Ala something like lua's aux api vs core api?)
>>>>> 
>>>>> 
>>>>>> On Thu, Jan 23, 2014 at 8:33 PM, Matei Zaharia <[email protected]> 
>>>>>> wrote:
>>>>>> I’d be happy to see this added to the core API.
>>>>>> 
>>>>>> Matei
>>>>>> 
>>>>>>> On Jan 23, 2014, at 5:39 PM, Andrew Ash <[email protected]> wrote:
>>>>>>> 
>>>>>>> Ah right of course -- perils of typing code without running it!
>>>>>>> 
>>>>>>> It feels like this is a pretty core operation that should be added to 
>>>>>>> the main RDD API.  Do other people not run into this often?
>>>>>>> 
>>>>>>> When I'm validating a foreign key join in my cluster I often check to 
>>>>>>> make sure that the foreign keys land on valid values on the referenced 
>>>>>>> table, and the way I do that is checking to see what percentage of the 
>>>>>>> references actually land.
>>>>>>> 
>>>>>>> 
>>>>>>>> On Thu, Jan 23, 2014 at 6:36 PM, Evan R. Sparks 
>>>>>>>> <[email protected]> wrote:
>>>>>>>> Yup (well, with _._1 at the end!)
>>>>>>>> 
>>>>>>>> 
>>>>>>>>> On Thu, Jan 23, 2014 at 5:28 PM, Andrew Ash <[email protected]> 
>>>>>>>>> wrote:
>>>>>>>>> You're thinking like this?
>>>>>>>>> 
>>>>>>>>> A.map(v => (v,None)).join(B.map(v => (v,None))).map(_._2)
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> On Thu, Jan 23, 2014 at 6:26 PM, Evan R. Sparks 
>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>> You could map each to an RDD[(String,None)] and do a join.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> On Thu, Jan 23, 2014 at 5:18 PM, Andrew Ash <[email protected]> 
>>>>>>>>>>> wrote:
>>>>>>>>>>> Hi spark users,
>>>>>>>>>>> 
>>>>>>>>>>> I recently wanted to calculate the set intersection of two RDDs of 
>>>>>>>>>>> Strings.  I couldn't find a .intersection() method in the 
>>>>>>>>>>> autocomplete or in the Scala API docs, so used a little set theory 
>>>>>>>>>>> to end up with this:
>>>>>>>>>>> 
>>>>>>>>>>> lazy val A = ...
>>>>>>>>>>> lazy val B = ...
>>>>>>>>>>> A.union(B).subtract(A.subtract(B)).subtract(B.subtract(A))
>>>>>>>>>>> 
>>>>>>>>>>> Which feels very cumbersome.
>>>>>>>>>>> 
>>>>>>>>>>> Does anyone have a more idiomatic way to calculate intersection?
>>>>>>>>>>> 
>>>>>>>>>>> Thanks!
>>>>>>>>>>> Andrew
>

Re: .intersection() method on RDDs?

Reply via email to