Jiacheng, if you're OK with using the Shark layer above Spark (and I think for many use cases the answer would be "yes"), then you can take advantage of Shark's co-partitioning. Or do something like https://github.com/amplab/shark/pull/100/commits
Sent while mobile. Pls excuse typos etc. On Nov 16, 2013 2:48 AM, "guojc" <[email protected]> wrote: > Hi Meisam, > What I want to achieve here is a bit tricky. Basically, I'm try to > implement PerSplit SemiJoin in a iterative algorithm with Spark. It's a > very efficient join strategy for high in-balanced data set and provide huge > gain against normal join in that situation., > > Let's say we have two large rdd, a: RDD[X] and b: RDD[Y,Z] and both > of them load directly from hdfs. So both of them will has a partitioner of > Nothing. And X is a large complicate struct contain a set of join key Y. > First for each partition of a , I extract join key Y from every ins of X > in that parition and construct a hash set of join key Y and paritionID. Now > I have a new rdd c :RDD[Y,PartionID ] and join it with b on Y and then > construct a rdd d:RDD[PartitionID,Map[Y,Z]] by groupby on PartitionID and > constructing map of Y and Z. As for each partition of a, I want to > repartiion it according to its partition id, and it becomes a rdd > e:RDD[PartitionID,X]. As both d and e will same partitioner and same key, > they will be joined very efficiently. > > The key ability I want to have here is the ability to cache rdd c with > same partitioner of rdd b and cache e. So later join with b and d will be > efficient, because the value of b will be updated from time to time and d's > content will change accordingly. And It will be nice to have the ability to > repartition a with its original paritionid without actually shuffle across > network. > > You can refer to > http://researcher.watson.ibm.com/researcher/files/us-ytian/hadoopjoin.pdf for > PerSplit SemiJoin's details. > > Best Regards, > Jiacheng Guo > > > On Sat, Nov 16, 2013 at 3:02 AM, Meisam Fathi <[email protected]>wrote: > >> Hi guojc, >> >> It is not cleat for me what problem you are trying to solve. What do >> you want to do with the result of your >> groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it >> in a join? Do you want to save it to your file system? Or do you want >> to do something else with it? >> >> Thanks, >> Meisam >> >> On Fri, Nov 15, 2013 at 12:56 PM, guojc <[email protected]> wrote: >> > Hi Meisam, >> > Thank you for response. I know each rdd has a partitioner. What I >> want >> > to achieved here is re-partition a piece of data according to my custom >> > partitioner. Currently I do that by >> groupByKey(myPartitioner).flatMapValues( >> > x=>x). But I'm a bit worried whether this will create additional temp >> object >> > collection, as result is first made into Seq the an collection of >> tupples. >> > Any suggestion? >> > >> > Best Regards, >> > Jiahcheng Guo >> > >> > >> > On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi <[email protected]> >> > wrote: >> >> >> >> Hi Jiacheng, >> >> >> >> Each RDD has a partitioner. You can define your own partitioner if the >> >> default partitioner does not suit your purpose. >> >> You can take a look at this >> >> >> >> >> http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf >> . >> >> >> >> Thanks, >> >> Meisam >> >> >> >> On Fri, Nov 15, 2013 at 6:54 AM, guojc <[email protected]> wrote: >> >> > Hi, >> >> > I'm wondering whether spark rdd can has a partitionedByKey >> function? >> >> > The >> >> > use of this function is to have a rdd distributed by according to a >> >> > cerntain >> >> > paritioner and cache it. And then further join performance by rdd >> with >> >> > same >> >> > partitoner will a great speed up. Currently, we only have a >> >> > groupByKeyFunction and generate a Seq of desired type , which is not >> >> > very >> >> > convenient. >> >> > >> >> > Btw, Sorry for last empty body email. I mistakenly hit the send >> >> > shortcut. >> >> > >> >> > >> >> > Best Regards, >> >> > Jiacheng Guo >> > >> > >> > >
