After looking at the api more carefully, I just found  I overlooked the
partitionBy function on PairRDDFunction.  It's the function I need. Sorry
for the confusion.

Best Regards,
Jiacheng Guo


On Sat, Nov 16, 2013 at 3:59 AM, Christopher Nguyen <[email protected]> wrote:

> Jiacheng, if you're OK with using the Shark layer above Spark (and I think
> for many use cases the answer would be "yes"), then you can take advantage
> of Shark's co-partitioning. Or do something like
> https://github.com/amplab/shark/pull/100/commits
>
> Sent while mobile. Pls excuse typos etc.
> On Nov 16, 2013 2:48 AM, "guojc" <[email protected]> wrote:
>
>> Hi Meisam,
>>      What I want to achieve here is a bit tricky. Basically, I'm try to
>> implement PerSplit SemiJoin in a iterative algorithm with Spark. It's a
>> very efficient join strategy for high in-balanced data set and provide huge
>> gain against normal join in that situation.,
>>
>>      Let's say we have two large rdd, a: RDD[X] and b: RDD[Y,Z] and both
>> of them load directly from hdfs. So both of them will has a partitioner of
>> Nothing. And X is a large complicate struct contain a set of join key Y.
>>  First for each partition of a , I extract join key Y from every ins of X
>> in that parition and construct a hash set of join key Y and paritionID. Now
>> I have a new rdd c :RDD[Y,PartionID ] and join it with b on Y and then
>> construct a rdd d:RDD[PartitionID,Map[Y,Z]] by groupby on PartitionID and
>> constructing map of Y and Z.  As for each partition of a, I want to
>> repartiion it according to its partition id, and it becomes a rdd
>>  e:RDD[PartitionID,X]. As both d and e will same partitioner and same key,
>> they will be joined very efficiently.
>>
>>     The key ability I want to have here is the ability to cache rdd c
>> with same partitioner of rdd b and cache e. So later join with b and d will
>> be efficient, because the value of b will be updated from time to time and
>> d's content will change accordingly. And It will be nice to have the
>> ability to repartition a with its original paritionid without actually
>> shuffle across network.
>>
>> You can refer to
>> http://researcher.watson.ibm.com/researcher/files/us-ytian/hadoopjoin.pdf for
>> PerSplit SemiJoin's details.
>>
>> Best Regards,
>> Jiacheng Guo
>>
>>
>> On Sat, Nov 16, 2013 at 3:02 AM, Meisam Fathi <[email protected]>wrote:
>>
>>> Hi guojc,
>>>
>>> It is not cleat for me what problem you are trying to solve. What do
>>> you want to do with the result of your
>>> groupByKey(myPartitioner).flatMapValues( x=>x)? Do you want to use it
>>> in a join? Do you want to save it to your file system? Or do you want
>>> to do something else with it?
>>>
>>> Thanks,
>>> Meisam
>>>
>>> On Fri, Nov 15, 2013 at 12:56 PM, guojc <[email protected]> wrote:
>>> > Hi Meisam,
>>> >     Thank you for response. I know each rdd has a partitioner. What I
>>> want
>>> > to achieved here is re-partition a piece of data according to my custom
>>> > partitioner. Currently I do that by
>>> groupByKey(myPartitioner).flatMapValues(
>>> > x=>x). But I'm a bit worried whether this will create additional temp
>>> object
>>> > collection, as result is first made into Seq the an collection of
>>> tupples.
>>> > Any suggestion?
>>> >
>>> > Best Regards,
>>> > Jiahcheng Guo
>>> >
>>> >
>>> > On Sat, Nov 16, 2013 at 12:24 AM, Meisam Fathi <[email protected]
>>> >
>>> > wrote:
>>> >>
>>> >> Hi Jiacheng,
>>> >>
>>> >> Each RDD has a partitioner. You can define your own partitioner if the
>>> >> default partitioner does not suit your purpose.
>>> >> You can take a look at this
>>> >>
>>> >>
>>> http://ampcamp.berkeley.edu/wp-content/uploads/2012/06/matei-zaharia-amp-camp-2012-advanced-spark.pdf
>>> .
>>> >>
>>> >> Thanks,
>>> >> Meisam
>>> >>
>>> >> On Fri, Nov 15, 2013 at 6:54 AM, guojc <[email protected]> wrote:
>>> >> > Hi,
>>> >> >   I'm wondering whether spark rdd can has a partitionedByKey
>>> function?
>>> >> > The
>>> >> > use of this function is to have a rdd distributed by according to a
>>> >> > cerntain
>>> >> > paritioner and cache it. And then further join performance by rdd
>>> with
>>> >> > same
>>> >> > partitoner will a great speed up. Currently, we only have a
>>> >> > groupByKeyFunction and generate a Seq of desired type , which is not
>>> >> > very
>>> >> > convenient.
>>> >> >
>>> >> > Btw, Sorry for last empty body email. I mistakenly hit the send
>>> >> > shortcut.
>>> >> >
>>> >> >
>>> >> > Best Regards,
>>> >> > Jiacheng Guo
>>> >
>>> >
>>>
>>
>>

Reply via email to