[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1138 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1138#issuecomment-46650603 Merging this in master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1138#issuecomment-46612985 Merged build finished. All automated tests passed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1138#issuecomment-46612986 All automated tests passed. Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15920/ --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1138#issuecomment-46608670 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1138#issuecomment-46608687 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/1138#issuecomment-46608633 Good catch. Change looks good to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...
GitHub user aarondav opened a pull request: https://github.com/apache/spark/pull/1138 SPARK-2203: PySpark defaults to use same num reduce partitions as map side For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster. In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark. JIRA: https://issues.apache.org/jira/browse/SPARK-2203 You can merge this pull request into a Git repository by running: $ git pull https://github.com/aarondav/spark pyfix Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1138.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1138 commit 1bd5751fad08b0b2c69f9a0816b6b20fa06621fe Author: Aaron Davidson Date: 2014-06-19T19:43:50Z SPARK-2203: PySpark defaults to use same num reduce partitions as map partitions For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), PySpark will always assume that the default parallelism to use for the reduce side is ctx.defaultParallelism, which is a constant typically determined by the number of cores in cluster. In contrast, Spark's Partitioner#defaultPartitioner will use the same number of reduce partitions as map partitions unless the defaultParallelism config is explicitly set. This tends to be a better default in order to avoid OOMs, and should also be the behavior of PySpark. JIRA: https://issues.apache.org/jira/browse/SPARK-2203 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---