Re: Filter data from one RDD based on data from another RDD

2015-02-25 Thread Himanish Kushary
Hello Imran, Thanks for your response. I noticed the "intersection" and "subtract" methods for a RDD, does they work based on hash off all the fields in a RDD record ? - Himanish On Thu, Feb 19, 2015 at 6:11 PM, Imran Rashid wrote: > the more scalable alternative is to do a join (or a variant

Re: Filter data from one RDD based on data from another RDD

2015-02-19 Thread Imran Rashid
the more scalable alternative is to do a join (or a variant like cogroup, leftOuterJoin, subtractByKey etc. found in PairRDDFunctions) the downside is this requires a shuffle of both your RDDs On Thu, Feb 19, 2015 at 3:36 PM, Himanish Kushary wrote: > Hi, > > I have two RDD's with csv data as b

Filter data from one RDD based on data from another RDD

2015-02-19 Thread Himanish Kushary
Hi, I have two RDD's with csv data as below : RDD-1 101970_5854301840,fbcf5485-e696-4100-9468-a17ec7c5bb43,19229261643 101970_5854301839,fbaf5485-e696-4100-9468-a17ec7c5bb39,9229261645 101970_5854301839,fbbf5485-e696-4100-9468-a17ec7c5bb39,9229261647 101970_17038953,546853f9-cf07-4700-b202-00f21