what about .groupBy doesn't work for you? 2015-05-07 8:17 GMT-04:00 Night Wolf <nightwolf...@gmail.com>:
> MyClass is a basic scala case class (using Spark 1.3.1); > > case class Result(crn: Long, pid: Int, promoWk: Int, windowKey: Int, ipi: > Double) { > override def hashCode(): Int = crn.hashCode() > } > > > On Wed, May 6, 2015 at 8:09 PM, ayan guha <guha.a...@gmail.com> wrote: > >> How does your MyClqss looks like? I was experimenting with Row class in >> python and apparently partitionby automatically takes first column as key. >> However, I am not sure how you can access a part of an object without >> deserializing it (either explicitly or Spark doing it for you).... >> >> On Wed, May 6, 2015 at 7:14 PM, Night Wolf <nightwolf...@gmail.com> >> wrote: >> >>> Hi, >>> >>> If I have an RDD[MyClass] and I want to partition it by the hash code of >>> MyClass for performance reasons, is there any way to do this without >>> converting it into a PairRDD RDD[(K,V)] and calling partitionBy??? >>> >>> Mapping it to a tuple2 seems like a waste of space/computation. >>> >>> It looks like the PairRDDFunctions..partitionBy() uses a >>> ShuffleRDD[K,V,C] requires K,V,C? Could I create a new >>> ShuffleRDD[MyClass,MyClass,MyClass](caseClassRdd, new HashParitioner)? >>> >>> Cheers, >>> N >>> >> >> >> >> -- >> Best Regards, >> Ayan Guha >> > >