Jiacheng, if you're OK with using the Shark layer above Spark (and I think
for many use cases the answer would be "yes"), then you can take advantage
of Shark's co-partitioning. Or do something like
https://github.com/amplab/shark/pull/100/commits

Sent while mobile. Pls excuse typos etc.
On Nov 16, 2013 2:48 AM, "guojc" <[email protected]> wrote:

> Hi Meisam,
>      What I want to achieve here is a bit tricky. Basically, I'm try to
> implement PerSplit SemiJoin in a iterative algorithm with Spark. It's a
> very efficient join strategy for high in-balanced data set and provide huge
> gain against normal join in that situation.,
>
>      Let's say we have two large rdd, a: RDD[X] and b: RDD[Y,Z] and both
> of them load directly from hdfs. So both of them will has a partitioner of
> Nothing. And X is a large complicate struct contain a set of join key Y.
>  First for each partition of a , I extract join key Y from every ins of X
> in that parition and construct a hash set of join key Y and paritionID. Now
> I have a new rdd c :RDD[Y,PartionID ] and join it with b on Y and then
> construct a rdd d:RDD[PartitionID,Map[Y,Z]] by groupby on PartitionID and
> constructing map of Y and Z.  As for each partition of a, I want to
> repartiion it according to its partition id, and it becomes a rdd
>  e:RDD[PartitionID,X]. As both d and e will same partitioner and same key,
> they will be joined very efficiently.
>
>     The key ability I want to have here is the ability to cache rdd c with
> same partitioner of rdd b and cache e. So later join with b and d will be
> efficient, because the value of b will be updated from time to time and d's
> content will change accordingly. And It will be nice to have the ability to
> repartition a with its original paritionid without actually shuffle across
> network.
>
> You can refer to
> http://researcher.watson.ibm.com/researcher/files/us-ytian/hadoopjoin.pdf for
> PerSplit SemiJoin's details.
>
> Best Regards,
> Jiacheng Guo
>
>
> On Sat, Nov 16, 2013 at 3:02 AM, Meisam Fathi <[email protected]>wrote:
>
>> Hi guojc,
>>
>> It is not cleat for me what problem you are trying to solve. What do
>> you want to do with the result of your
>> groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it
>> in a join? Do you want to save it to your file system? Or do you want
>> to do something else with it?
>>
>> Thanks,
>> Meisam
>>
>> On Fri, Nov 15, 2013 at 12:56 PM, guojc <[email protected]> wrote:
>> > Hi Meisam,
>> >     Thank you for response. I know each rdd has a partitioner. What I
>> want
>> > to achieved here is re-partition a piece of data according to my custom
>> > partitioner. Currently I do that by
>> groupByKey(myPartitioner).flatMapValues(
>> > x=>x). But I'm a bit worried whether this will create additional temp
>> object
>> > collection, as result is first made into Seq the an collection of
>> tupples.
>> > Any suggestion?
>> >
>> > Best Regards,
>> > Jiahcheng Guo
>> >
>> >
>> > On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi <[email protected]>
>> > wrote:
>> >>
>> >> Hi Jiacheng,
>> >>
>> >> Each RDD has a partitioner. You can define your own partitioner if the
>> >> default partitioner does not suit your purpose.
>> >> You can take a look at this
>> >>
>> >>
>> http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
>> .
>> >>
>> >> Thanks,
>> >> Meisam
>> >>
>> >> On Fri, Nov 15, 2013 at 6:54 AM, guojc <[email protected]> wrote:
>> >> > Hi,
>> >> >   I'm wondering whether spark rdd can has a partitionedByKey
>> function?
>> >> > The
>> >> > use of this function is to have a rdd distributed by according to a
>> >> > cerntain
>> >> > paritioner and cache it. And then further join performance by rdd
>> with
>> >> > same
>> >> > partitoner will a great speed up. Currently, we only have a
>> >> > groupByKeyFunction and generate a Seq of desired type , which is not
>> >> > very
>> >> > convenient.
>> >> >
>> >> > Btw, Sorry for last empty body email. I mistakenly hit the send
>> >> > shortcut.
>> >> >
>> >> >
>> >> > Best Regards,
>> >> > Jiacheng Guo
>> >
>> >
>>
>
>

Reply via email to