Re: Partition Case Class RDD without ParRDDFunctions

Jonathan Coveney Thu, 07 May 2015 10:31:47 -0700

what about .groupBy doesn't work for you?

2015-05-07 8:17 GMT-04:00 Night Wolf <nightwolf...@gmail.com>:


> MyClass is a basic scala case class (using Spark 1.3.1);
>
> case class Result(crn: Long, pid: Int, promoWk: Int, windowKey: Int, ipi: 
> Double) {
>   override def hashCode(): Int = crn.hashCode()
> }
>
>
> On Wed, May 6, 2015 at 8:09 PM, ayan guha <guha.a...@gmail.com> wrote:
>
>> How does your MyClqss looks like? I was experimenting with Row class in
>> python and apparently partitionby automatically takes first column as key.
>> However, I am not sure how you can access a part of an object without
>> deserializing it (either explicitly or Spark doing it for you)....
>>
>> On Wed, May 6, 2015 at 7:14 PM, Night Wolf <nightwolf...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> If I have an RDD[MyClass] and I want to partition it by the hash code of
>>> MyClass for performance reasons, is there any way to do this without
>>> converting it into a PairRDD RDD[(K,V)] and calling partitionBy???
>>>
>>> Mapping it to a tuple2 seems like a waste of space/computation.
>>>
>>> It looks like the PairRDDFunctions..partitionBy() uses a
>>> ShuffleRDD[K,V,C] requires K,V,C? Could I create a new
>>> ShuffleRDD[MyClass,MyClass,MyClass](caseClassRdd, new HashParitioner)?
>>>
>>> Cheers,
>>> N
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>

Re: Partition Case Class RDD without ParRDDFunctions

Reply via email to