Re: How to make the result of sortByKey distributed evenly?

Fridtjof Sander Tue, 06 Sep 2016 01:39:28 -0700

Your data has only two keys, and basically all values are assigned toonly one of them. There is no better way to distribute the keys, thanthe one Spark executes.

What you have to do is to use different keys to sort and range-partitionon. Try to invoke sortBy() on a non-pair-RDD. This will take both partsof your data as key so sort on. You can also set your tuple as keymanually, and set a constant int or something as value.


Am 06.09.16 um 10:13 schrieb Zhang, Liyun:

Hi all:

  I have a question about RDD.sortByKey

val n=20000
val sorted=sc.parallelize(2 to n).map(x=>(x/n,x)).sortByKey()
 sorted.saveAsTextFile("hdfs://bdpe42:8020/SkewedGroupByTest")
sc.parallelize(2 to n).map(x=>(x/n,x)) will generate pairs like[(0,2),(0,3),…..,(0,19999),(1,20000)], the key is skewed.
The result of sortByKey is expected to distributed evenly. But when Iview the result and found that part-00000 is large and part-00001 issmall.
 hadoop fs -ls /SkewedGroupByTest/
16/09/06 03:24:55 WARN util.NativeCodeLoader: Unable to loadnative-hadoop library for your platform... using builtin-java classeswhere applicable
Found 3 items
-rw-r--r-- 1 root supergroup 0 2016-09-06 03:21 /SkewedGroupByTest/_SUCCESS-rw-r--r-- 1 root supergroup 188878 2016-09-06 03:21/SkewedGroupByTest/part-00000-rw-r--r-- 1 root supergroup 10 2016-09-06 03:21/SkewedGroupByTest/part-00001
How can I get the result distributed evenly? I don’t need that thekey in the part-xxxxx are same and only need to guarantee the data inpart-xxxx0 ~ part-xxxxx is sorted.
Thanks for any help!

Kelly Zhang/Zhang,Liyun

Best Regards

Re: How to make the result of sortByKey distributed evenly?

Reply via email to