Hi Meisam,
     What I want to achieve here is a bit tricky. Basically, I'm try to
implement PerSplit SemiJoin in a iterative algorithm with Spark. It's a
very efficient join strategy for high in-balanced data set and provide huge
gain against normal join in that situation.,

     Let's say we have two large rdd, a: RDD[X] and b: RDD[Y,Z] and both of
them load directly from hdfs. So both of them will has a partitioner of
Nothing. And X is a large complicate struct contain a set of join key Y.
 First for each partition of a , I extract join key Y from every ins of X
in that parition and construct a hash set of join key Y and paritionID. Now
I have a new rdd c :RDD[Y,PartionID ] and join it with b on Y and then
construct a rdd d:RDD[PartitionID,Map[Y,Z]] by groupby on PartitionID and
constructing map of Y and Z.  As for each partition of a, I want to
repartiion it according to its partition id, and it becomes a rdd
 e:RDD[PartitionID,X]. As both d and e will same partitioner and same key,
they will be joined very efficiently.

    The key ability I want to have here is the ability to cache rdd c with
same partitioner of rdd b and cache e. So later join with b and d will be
efficient, because the value of b will be updated from time to time and d's
content will change accordingly. And It will be nice to have the ability to
repartition a with its original paritionid without actually shuffle across
network.

You can refer to
http://researcher.watson.ibm.com/researcher/files/us-ytian/hadoopjoin.pdf for
PerSplit SemiJoin's details.

Best Regards,
Jiacheng Guo


On Sat, Nov 16, 2013 at 3:02 AM, Meisam Fathi <[email protected]>wrote:

> Hi guojc,
>
> It is not cleat for me what problem you are trying to solve. What do
> you want to do with the result of your
> groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it
> in a join? Do you want to save it to your file system? Or do you want
> to do something else with it?
>
> Thanks,
> Meisam
>
> On Fri, Nov 15, 2013 at 12:56 PM, guojc <[email protected]> wrote:
> > Hi Meisam,
> >     Thank you for response. I know each rdd has a partitioner. What I
> want
> > to achieved here is re-partition a piece of data according to my custom
> > partitioner. Currently I do that by
> groupByKey(myPartitioner).flatMapValues(
> > x=>x). But I'm a bit worried whether this will create additional temp
> object
> > collection, as result is first made into Seq the an collection of
> tupples.
> > Any suggestion?
> >
> > Best Regards,
> > Jiahcheng Guo
> >
> >
> > On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi <[email protected]>
> > wrote:
> >>
> >> Hi Jiacheng,
> >>
> >> Each RDD has a partitioner. You can define your own partitioner if the
> >> default partitioner does not suit your purpose.
> >> You can take a look at this
> >>
> >>
> http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
> .
> >>
> >> Thanks,
> >> Meisam
> >>
> >> On Fri, Nov 15, 2013 at 6:54 AM, guojc <[email protected]> wrote:
> >> > Hi,
> >> >   I'm wondering whether spark rdd can has a partitionedByKey function?
> >> > The
> >> > use of this function is to have a rdd distributed by according to a
> >> > cerntain
> >> > paritioner and cache it. And then further join performance by rdd with
> >> > same
> >> > partitoner will a great speed up. Currently, we only have a
> >> > groupByKeyFunction and generate a Seq of desired type , which is not
> >> > very
> >> > convenient.
> >> >
> >> > Btw, Sorry for last empty body email. I mistakenly hit the send
> >> > shortcut.
> >> >
> >> >
> >> > Best Regards,
> >> > Jiacheng Guo
> >
> >
>

Reply via email to