After looking at the api more carefully, I just found I overlooked the partitionBy function on PairRDDFunction. It's the function I need. Sorry for the confusion.
Best Regards, Jiacheng Guo On Sat, Nov 16, 2013 at 3:59 AM, Christopher Nguyen <[email protected]> wrote: > Jiacheng, if you're OK with using the Shark layer above Spark (and I think > for many use cases the answer would be "yes"), then you can take advantage > of Shark's co-partitioning. Or do something like > https://github.com/amplab/shark/pull/100/commits > > Sent while mobile. Pls excuse typos etc. > On Nov 16, 2013 2:48 AM, "guojc" <[email protected]> wrote: > >> Hi Meisam, >> What I want to achieve here is a bit tricky. Basically, I'm try to >> implement PerSplit SemiJoin in a iterative algorithm with Spark. It's a >> very efficient join strategy for high in-balanced data set and provide huge >> gain against normal join in that situation., >> >> Let's say we have two large rdd, a: RDD[X] and b: RDD[Y,Z] and both >> of them load directly from hdfs. So both of them will has a partitioner of >> Nothing. And X is a large complicate struct contain a set of join key Y. >> First for each partition of a , I extract join key Y from every ins of X >> in that parition and construct a hash set of join key Y and paritionID. Now >> I have a new rdd c :RDD[Y,PartionID ] and join it with b on Y and then >> construct a rdd d:RDD[PartitionID,Map[Y,Z]] by groupby on PartitionID and >> constructing map of Y and Z. As for each partition of a, I want to >> repartiion it according to its partition id, and it becomes a rdd >> e:RDD[PartitionID,X]. As both d and e will same partitioner and same key, >> they will be joined very efficiently. >> >> The key ability I want to have here is the ability to cache rdd c >> with same partitioner of rdd b and cache e. So later join with b and d will >> be efficient, because the value of b will be updated from time to time and >> d's content will change accordingly. And It will be nice to have the >> ability to repartition a with its original paritionid without actually >> shuffle across network. >> >> You can refer to >> http://researcher.watson.ibm.com/researcher/files/us-ytian/hadoopjoin.pdf for >> PerSplit SemiJoin's details. >> >> Best Regards, >> Jiacheng Guo >> >> >> On Sat, Nov 16, 2013 at 3:02 AM, Meisam Fathi <[email protected]>wrote: >> >>> Hi guojc, >>> >>> It is not cleat for me what problem you are trying to solve. What do >>> you want to do with the result of your >>> groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it >>> in a join? Do you want to save it to your file system? Or do you want >>> to do something else with it? >>> >>> Thanks, >>> Meisam >>> >>> On Fri, Nov 15, 2013 at 12:56 PM, guojc <[email protected]> wrote: >>> > Hi Meisam, >>> > Thank you for response. I know each rdd has a partitioner. What I >>> want >>> > to achieved here is re-partition a piece of data according to my custom >>> > partitioner. Currently I do that by >>> groupByKey(myPartitioner).flatMapValues( >>> > x=>x). But I'm a bit worried whether this will create additional temp >>> object >>> > collection, as result is first made into Seq the an collection of >>> tupples. >>> > Any suggestion? >>> > >>> > Best Regards, >>> > Jiahcheng Guo >>> > >>> > >>> > On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi <[email protected] >>> > >>> > wrote: >>> >> >>> >> Hi Jiacheng, >>> >> >>> >> Each RDD has a partitioner. You can define your own partitioner if the >>> >> default partitioner does not suit your purpose. >>> >> You can take a look at this >>> >> >>> >> >>> http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf >>> . >>> >> >>> >> Thanks, >>> >> Meisam >>> >> >>> >> On Fri, Nov 15, 2013 at 6:54 AM, guojc <[email protected]> wrote: >>> >> > Hi, >>> >> > I'm wondering whether spark rdd can has a partitionedByKey >>> function? >>> >> > The >>> >> > use of this function is to have a rdd distributed by according to a >>> >> > cerntain >>> >> > paritioner and cache it. And then further join performance by rdd >>> with >>> >> > same >>> >> > partitoner will a great speed up. Currently, we only have a >>> >> > groupByKeyFunction and generate a Seq of desired type , which is not >>> >> > very >>> >> > convenient. >>> >> > >>> >> > Btw, Sorry for last empty body email. I mistakenly hit the send >>> >> > shortcut. >>> >> > >>> >> > >>> >> > Best Regards, >>> >> > Jiacheng Guo >>> > >>> > >>> >> >>
