Your data has only two keys, and basically all values are assigned to
only one of them. There is no better way to distribute the keys, than
the one Spark executes.
What you have to do is to use different keys to sort and range-partition
on. Try to invoke sortBy() on a non-pair-RDD. This will take both parts
of your data as key so sort on. You can also set your tuple as key
manually, and set a constant int or something as value.
Am 06.09.16 um 10:13 schrieb Zhang, Liyun:
Hi all:
I have a question about RDD.sortByKey
val n=20000
val sorted=sc.parallelize(2 to n).map(x=>(x/n,x)).sortByKey()
sorted.saveAsTextFile("hdfs://bdpe42:8020/SkewedGroupByTest")
sc.parallelize(2 to n).map(x=>(x/n,x)) will generate pairs like
[(0,2),(0,3),…..,(0,19999),(1,20000)], the key is skewed.
The result of sortByKey is expected to distributed evenly. But when I
view the result and found that part-00000 is large and part-00001 is
small.
hadoop fs -ls /SkewedGroupByTest/
16/09/06 03:24:55 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
Found 3 items
-rw-r--r-- 1 root supergroup 0 2016-09-06 03:21 /SkewedGroupByTest
/_SUCCESS
-rw-r--r-- 1 root supergroup 188878 2016-09-06 03:21
/SkewedGroupByTest/part-00000
-rw-r--r-- 1 root supergroup 10 2016-09-06 03:21
/SkewedGroupByTest/part-00001
How can I get the result distributed evenly? I don’t need that the
key in the part-xxxxx are same and only need to guarantee the data in
part-xxxx0 ~ part-xxxxx is sorted.
Thanks for any help!
Kelly Zhang/Zhang,Liyun
Best Regards