Turned out that is was sufficient do to repartitionAndSortWithinPartitions
... so far so good ;)
On Tue, May 5, 2015 at 9:45 AM Marius Danciu marius.dan...@gmail.com
wrote:
Hi Imran,
Yes that's what MyPartitioner does. I do see (using traces from
MyPartitioner) that the key is partitioned on
Hi Imran,
Yes that's what MyPartitioner does. I do see (using traces from
MyPartitioner) that the key is partitioned on partition 0 but then I see
this record arriving in both Yarn containers (I see it in the logs).
Basically I need to emulate a Hadoop map-reduce job in Spark and groupByKey
Hi Marius,
I am also a little confused -- are you saying that myPartitions is
basically something like:
class MyPartitioner extends Partitioner {
def numPartitions = 1
def getPartition(key: Any) = 0
}
??
If so, I don't understand how you'd ever end up data in two partitions.
Indeed, than
repartitionAndSortWithinPartitions to do it in one shot.
Thanks,
Silvio
From: Marius Danciu
Date: Tuesday, April 28, 2015 at 8:10 AM
To: user
Subject: Spark partitioning question
Hello all,
I have the following Spark (pseudo)code:
rdd = mapPartitionsWithIndex
Hello all,
I have the following Spark (pseudo)code:
rdd = mapPartitionsWithIndex(...)
.mapPartitionsToPair(...)
.groupByKey()
.sortByKey(comparator)
.partitionBy(myPartitioner)
.mapPartitionsWithIndex(...)
.mapPartitionsToPair( *f* )
The input
.
From: Marius Danciu
Date: Tuesday, April 28, 2015 at 9:53 AM
To: Silvio Fiorito, user
Subject: Re: Spark partitioning question
Thank you Silvio,
I am aware of groubBy limitations and this is subject for replacement.
I did try repartitionAndSortWithinPartitions but then I end up with maybe too