[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...

2014-06-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1138


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...

2014-06-20 Thread rxin
Github user rxin commented on the pull request:

https://github.com/apache/spark/pull/1138#issuecomment-46650603
  
Merging this in master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1138#issuecomment-46612985
  
Merged build finished. All automated tests passed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1138#issuecomment-46612986
  
All automated tests passed.
Refer to this link for build results: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/15920/


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1138#issuecomment-46608670
  
 Merged build triggered. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...

2014-06-19 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/1138#issuecomment-46608687
  
Merged build started. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...

2014-06-19 Thread mateiz
Github user mateiz commented on the pull request:

https://github.com/apache/spark/pull/1138#issuecomment-46608633
  
Good catch. Change looks good to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2203: PySpark defaults to use same num r...

2014-06-19 Thread aarondav
GitHub user aarondav opened a pull request:

https://github.com/apache/spark/pull/1138

SPARK-2203: PySpark defaults to use same num reduce partitions as map side

For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), 
PySpark will always assume that the default parallelism to use for the reduce 
side is ctx.defaultParallelism, which is a constant typically determined by the 
number of cores in cluster.

In contrast, Spark's Partitioner#defaultPartitioner will use the same 
number of reduce partitions as map partitions unless the defaultParallelism 
config is explicitly set. This tends to be a better default in order to avoid 
OOMs, and should also be the behavior of PySpark.

JIRA: https://issues.apache.org/jira/browse/SPARK-2203

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/aarondav/spark pyfix

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1138.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1138


commit 1bd5751fad08b0b2c69f9a0816b6b20fa06621fe
Author: Aaron Davidson 
Date:   2014-06-19T19:43:50Z

SPARK-2203: PySpark defaults to use same num reduce partitions as map 
partitions

For shuffle-based operators, such as rdd.groupBy() or rdd.sortByKey(), 
PySpark will always
assume that the default parallelism to use for the reduce side is 
ctx.defaultParallelism,
which is a constant typically determined by the number of cores in cluster.

In contrast, Spark's Partitioner#defaultPartitioner will use the same 
number of reduce
partitions as map partitions unless the defaultParallelism config is 
explicitly set. This
tends to be a better default in order to avoid OOMs, and should also be the 
behavior of PySpark.

JIRA: https://issues.apache.org/jira/browse/SPARK-2203




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---