Your data has only two keys, and basically all values are assigned to only one of them. There is no better way to distribute the keys, than the one Spark executes.

What you have to do is to use different keys to sort and range-partition on. Try to invoke sortBy() on a non-pair-RDD. This will take both parts of your data as key so sort on. You can also set your tuple as key manually, and set a constant int or something as value.

Am 06.09.16 um 10:13 schrieb Zhang, Liyun:

Hi all:

  I have a question about RDD.sortByKey

val n=20000
val sorted=sc.parallelize(2 to n).map(x=>(x/n,x)).sortByKey()
 sorted.saveAsTextFile("hdfs://bdpe42:8020/SkewedGroupByTest")

sc.parallelize(2 to n).map(x=>(x/n,x)) will generate pairs like [(0,2),(0,3),…..,(0,19999),(1,20000)], the key is skewed.

The result of sortByKey is expected to distributed evenly. But when I view the result and found that part-00000 is large and part-00001 is small.

 hadoop fs -ls /SkewedGroupByTest/
16/09/06 03:24:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 3 items
-rw-r--r-- 1 root supergroup 0 2016-09-06 03:21 /SkewedGroupByTest /_SUCCESS -rw-r--r-- 1 root supergroup 188878 2016-09-06 03:21 /SkewedGroupByTest/part-00000 -rw-r--r-- 1 root supergroup 10 2016-09-06 03:21 /SkewedGroupByTest/part-00001

How can I get the result distributed evenly? I don’t need that the key in the part-xxxxx are same and only need to guarantee the data in part-xxxx0 ~ part-xxxxx is sorted.

Thanks for any help!

Kelly Zhang/Zhang,Liyun

Best Regards


Reply via email to