And I think the followup to Ian's question: Is there a way to implement .intersect() in the core API that's more efficient than the .join() method Evan suggested?
Andrew On Thu, Jan 23, 2014 at 10:26 PM, Ian O'Connell <[email protected]> wrote: > Is there any separation in the API between functions that can be built > solely on the existing exposed public API and ones which require access to > internals? > > Just to maybe avoid bloat for composite functions like this that are for > user convenience? > > (Ala something like lua's aux api vs core api?) > > > On Thu, Jan 23, 2014 at 8:33 PM, Matei Zaharia <[email protected]>wrote: > >> I’d be happy to see this added to the core API. >> >> Matei >> >> On Jan 23, 2014, at 5:39 PM, Andrew Ash <[email protected]> wrote: >> >> Ah right of course -- perils of typing code without running it! >> >> It feels like this is a pretty core operation that should be added to the >> main RDD API. Do other people not run into this often? >> >> When I'm validating a foreign key join in my cluster I often check to >> make sure that the foreign keys land on valid values on the referenced >> table, and the way I do that is checking to see what percentage of the >> references actually land. >> >> >> On Thu, Jan 23, 2014 at 6:36 PM, Evan R. Sparks <[email protected]>wrote: >> >>> Yup (well, with _._1 at the end!) >>> >>> >>> On Thu, Jan 23, 2014 at 5:28 PM, Andrew Ash <[email protected]>wrote: >>> >>>> You're thinking like this? >>>> >>>> A.map(v => (v,None)).join(B.map(v => (v,None))).map(_._2) >>>> >>>> >>>> On Thu, Jan 23, 2014 at 6:26 PM, Evan R. Sparks >>>> <[email protected]>wrote: >>>> >>>>> You could map each to an RDD[(String,None)] and do a join. >>>>> >>>>> >>>>> On Thu, Jan 23, 2014 at 5:18 PM, Andrew Ash <[email protected]>wrote: >>>>> >>>>>> Hi spark users, >>>>>> >>>>>> I recently wanted to calculate the set intersection of two RDDs of >>>>>> Strings. I couldn't find a .intersection() method in the autocomplete or >>>>>> in the Scala API docs, so used a little set theory to end up with this: >>>>>> >>>>>> lazy val A = ... >>>>>> lazy val B = ... >>>>>> A.union(B).subtract(A.subtract(B)).subtract(B.subtract(A)) >>>>>> >>>>>> Which feels very cumbersome. >>>>>> >>>>>> Does anyone have a more idiomatic way to calculate intersection? >>>>>> >>>>>> Thanks! >>>>>> Andrew >>>>>> >>>>> >>>>> >>>> >>> >> >> >
