[ https://issues.apache.org/jira/browse/SPARK-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336128#comment-14336128 ]
Apache Spark commented on SPARK-5969: ------------------------------------- User 'foxik' has created a pull request for this issue: https://github.com/apache/spark/pull/4761 > The pyspark.rdd.sortByKey always fills only two partitions when > ascending=False. > -------------------------------------------------------------------------------- > > Key: SPARK-5969 > URL: https://issues.apache.org/jira/browse/SPARK-5969 > Project: Spark > Issue Type: Bug > Components: PySpark > Affects Versions: 1.2.1 > Environment: Linux, 64bit > Reporter: Milan Straka > > The pyspark.rdd.sortByKey always fills only two partitions when > ascending=False -- the first one and the last one. > Simple example sorting numbers 0..999 into 10 partitions in descending order: > {code} > sc.parallelize(range(1000)).keyBy(lambda x:x).sortByKey(ascending=False, > numPartitions=10).glom().map(len).collect() > {code} > returns the following partition sizes: > {code} > [469, 0, 0, 0, 0, 0, 0, 0, 0, 531] > {code} > When ascending=True, all works as expected: > {code} > sc.parallelize(range(1000)).keyBy(lambda x:x).sortByKey(ascending=True, > numPartitions=10).glom().map(len).collect() > Out: [116, 96, 100, 87, 132, 101, 101, 95, 87, 85] > {code} > The problem is caused by the following line 565 in rdd.py: > {code} > samples = sorted(samples, reverse=(not ascending), key=keyfunc) > {code} > That sorts the samples descending if ascending=False. Nevertheless samples > should always be in ascending order, because it is (after subsampling to > variable bounds) used in a bisect_left call: > {code} > def rangePartitioner(k): > p = bisect.bisect_left(bounds, keyfunc(k)) > if ascending: > return p > else: > return numPartitions - 1 - p > {code} > As you can see, rangePartitioner already handles the ascending=False by > itself, so the fix for the whole problem is really trivial: just change line > rdd.py:565 to > {{samples = sorted(samples, -reverse=(not ascending),- key=keyfunc)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org