[jira] [Commented] (SPARK-5969) The pyspark.rdd.sortByKey always fills only two partitions when ascending=False.

Apache Spark (JIRA) Tue, 24 Feb 2015 23:09:50 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-5969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14336128#comment-14336128
 ]


Apache Spark commented on SPARK-5969:
-------------------------------------

User 'foxik' has created a pull request for this issue:
https://github.com/apache/spark/pull/4761

> The pyspark.rdd.sortByKey always fills only two partitions when 
> ascending=False.
> --------------------------------------------------------------------------------
>
>                 Key: SPARK-5969
>                 URL: https://issues.apache.org/jira/browse/SPARK-5969
>             Project: Spark
>          Issue Type: Bug
>          Components: PySpark
>    Affects Versions: 1.2.1
>         Environment: Linux, 64bit
>            Reporter: Milan Straka
>
> The pyspark.rdd.sortByKey always fills only two partitions when 
> ascending=False -- the first one and the last one.
> Simple example sorting numbers 0..999 into 10 partitions in descending order:
> {code}
> sc.parallelize(range(1000)).keyBy(lambda x:x).sortByKey(ascending=False, 
> numPartitions=10).glom().map(len).collect()
> {code}
> returns the following partition sizes:
> {code}
> [469, 0, 0, 0, 0, 0, 0, 0, 0, 531]
> {code}
> When ascending=True, all works as expected:
> {code}
> sc.parallelize(range(1000)).keyBy(lambda x:x).sortByKey(ascending=True, 
> numPartitions=10).glom().map(len).collect()
> Out: [116, 96, 100, 87, 132, 101, 101, 95, 87, 85]
> {code}
> The problem is caused by the following line 565 in rdd.py:
> {code}
>         samples = sorted(samples, reverse=(not ascending), key=keyfunc)
> {code}
> That sorts the samples descending if ascending=False. Nevertheless samples 
> should always be in ascending order, because it is (after subsampling to 
> variable bounds) used in a bisect_left call:
> {code}
>         def rangePartitioner(k):
>             p = bisect.bisect_left(bounds, keyfunc(k))
>             if ascending:
>                 return p
>             else:
>                 return numPartitions - 1 - p
> {code}
> As you can see, rangePartitioner already handles the ascending=False by 
> itself, so the fix for the whole problem is really trivial: just change line 
> rdd.py:565 to
> {{samples = sorted(samples, -reverse=(not ascending),- key=keyfunc)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5969) The pyspark.rdd.sortByKey always fills only two partitions when ascending=False.

Reply via email to