[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2015-02-26 Thread scwf
Github user scwf closed the pull request at:

https://github.com/apache/spark/pull/3694


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2015-02-26 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-76318262
  
sorry for delay, my initial idea here is 
1 we can set spark.default.parallsim to control the partitions num for 
shuffle but this config option do not sensitive to data size of rdd, that is 
for one job with 1T input data the partitions num is x but for the same job 
with 1K input data the partitions num is also x. 

2 if we not set spark.default.parallsim, spark rdd use parent rdd's 
partitions num as its partitions num, but in this way i found that there maybe 
some mini-tasks in some case due to the big partitions num of parent rdd, so i 
think maybe we can give a ratio to control the shuffle partition num 

ok, i am closing this



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2015-02-25 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-76069557
  
I agree. @scwf would you mind closing this issue?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2015-02-25 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-75936767
  
I suggest we close this as I see arguments against, and no replies to those 
and/or the motivation for this change.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2015-02-24 Thread lianhuiwang
Github user lianhuiwang commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-75903702
  
i do not think that a global default ratio is right. because in a job the 
size of each stage is different and they are not Increasing or decreasing.  if 
we define a partition's ratio for per shuffle operation, there are no different 
between setting a ration and setting partition number for per shuffle operation.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2015-02-23 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-75563929
  
I am also not clear this is a good thing. As a default, it doesn't change 
anything. There is probably not a globally correct ratio, even if it's not 1, 
but this implies there is. Is there evidence that a default besides 1.0 is 
better in most cases? The docs don't even suggest what the tradeoff is here.

Won't this potentially cause more shuffles when the ratio is not 1? I think 
this is something that must be set on a case-by-case basis, and that can 
already be done, even as a function of the parent RDD partitions, by the caller.

Can we elaborate on this or close it?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2015-02-23 Thread srowen
Github user srowen commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-75581853
  
You can implement this by expressing parallelism as a function of the 
parent RDD right? yeah you have to write the expression but does an alternative 
multiplier arg do much better? yeah mostly I'm questioning a global setting.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2015-02-23 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-75584312
  
@srowen good point.  I think a ratio argument is prettier than an 
expression, but arguably not enough to warrant clogging up the API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2015-02-23 Thread sryza
Github user sryza commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-75580971
  
In general, a fixed number of partitions is very difficult to work with 
when configuring a shuffle.  Suppose I have a job where I know a `flatMap` is 
going to blow up the size of my data by two.  If I want to minimize reduce-side 
spilling in a shuffle that comes after the `flatMap`, I want the parallelism of 
the shuffle to be double that of the input stage.  Because the size of my input 
data could change between different runs of my job, a ratio is a much more 
natural way to express my needs than a constant.

It's unclear to me whether a global default is useful at all, but a 
configurable parallelism ratio per shuffle operation definitely is.  (Systems 
like Crunch take this approach).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2015-02-19 Thread andrewor14
Github user andrewor14 commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-75145453
  
Hi @scwf can you elaborate on the motivation for this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2014-12-15 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-67095261
  
Jekins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2014-12-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-67095615
  
  [Test build #24474 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24474/consoleFull)
 for   PR 3694 at commit 
[`f21bfd4`](https://github.com/apache/spark/commit/f21bfd4904fa340099d190bd3963fefc79f0faa4).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2014-12-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-67102568
  
  [Test build #24474 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24474/consoleFull)
 for   PR 3694 at commit 
[`f21bfd4`](https://github.com/apache/spark/commit/f21bfd4904fa340099d190bd3963fefc79f0faa4).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2014-12-15 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-67102573
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24474/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2014-12-14 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-66952763
  
Jekins, retest this please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2014-12-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-66952871
  
  [Test build #24450 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24450/consoleFull)
 for   PR 3694 at commit 
[`f21bfd4`](https://github.com/apache/spark/commit/f21bfd4904fa340099d190bd3963fefc79f0faa4).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2014-12-14 Thread scwf
Github user scwf commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-66956061
  
Hmm, seems there are some problems with 
```org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite```, and i 
noticed that other PRs also failed there.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2014-12-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-66956286
  
  [Test build #24450 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24450/consoleFull)
 for   PR 3694 at commit 
[`f21bfd4`](https://github.com/apache/spark/commit/f21bfd4904fa340099d190bd3963fefc79f0faa4).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...

2014-12-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/3694#issuecomment-66956294
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24450/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org