[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user scwf closed the pull request at: https://github.com/apache/spark/pull/3694 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-76318262 sorry for delay, my initial idea here is 1 we can set spark.default.parallsim to control the partitions num for shuffle but this config option do not sensitive to data size of rdd, that is for one job with 1T input data the partitions num is x but for the same job with 1K input data the partitions num is also x. 2 if we not set spark.default.parallsim, spark rdd use parent rdd's partitions num as its partitions num, but in this way i found that there maybe some mini-tasks in some case due to the big partitions num of parent rdd, so i think maybe we can give a ratio to control the shuffle partition num ok, i am closing this --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-76069557 I agree. @scwf would you mind closing this issue? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-75936767 I suggest we close this as I see arguments against, and no replies to those and/or the motivation for this change. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user lianhuiwang commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-75903702 i do not think that a global default ratio is right. because in a job the size of each stage is different and they are not Increasing or decreasing. if we define a partition's ratio for per shuffle operation, there are no different between setting a ration and setting partition number for per shuffle operation. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-75563929 I am also not clear this is a good thing. As a default, it doesn't change anything. There is probably not a globally correct ratio, even if it's not 1, but this implies there is. Is there evidence that a default besides 1.0 is better in most cases? The docs don't even suggest what the tradeoff is here. Won't this potentially cause more shuffles when the ratio is not 1? I think this is something that must be set on a case-by-case basis, and that can already be done, even as a function of the parent RDD partitions, by the caller. Can we elaborate on this or close it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-75581853 You can implement this by expressing parallelism as a function of the parent RDD right? yeah you have to write the expression but does an alternative multiplier arg do much better? yeah mostly I'm questioning a global setting. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-75584312 @srowen good point. I think a ratio argument is prettier than an expression, but arguably not enough to warrant clogging up the API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-75580971 In general, a fixed number of partitions is very difficult to work with when configuring a shuffle. Suppose I have a job where I know a `flatMap` is going to blow up the size of my data by two. If I want to minimize reduce-side spilling in a shuffle that comes after the `flatMap`, I want the parallelism of the shuffle to be double that of the input stage. Because the size of my input data could change between different runs of my job, a ratio is a much more natural way to express my needs than a constant. It's unclear to me whether a global default is useful at all, but a configurable parallelism ratio per shuffle operation definitely is. (Systems like Crunch take this approach). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user andrewor14 commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-75145453 Hi @scwf can you elaborate on the motivation for this? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-67095261 Jekins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-67095615 [Test build #24474 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24474/consoleFull) for PR 3694 at commit [`f21bfd4`](https://github.com/apache/spark/commit/f21bfd4904fa340099d190bd3963fefc79f0faa4). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-67102568 [Test build #24474 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24474/consoleFull) for PR 3694 at commit [`f21bfd4`](https://github.com/apache/spark/commit/f21bfd4904fa340099d190bd3963fefc79f0faa4). * This patch **passes all tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-67102573 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24474/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-66952763 Jekins, retest this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-66952871 [Test build #24450 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24450/consoleFull) for PR 3694 at commit [`f21bfd4`](https://github.com/apache/spark/commit/f21bfd4904fa340099d190bd3963fefc79f0faa4). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user scwf commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-66956061 Hmm, seems there are some problems with ```org.apache.spark.streaming.rdd.WriteAheadLogBackedBlockRDDSuite```, and i noticed that other PRs also failed there. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-66956286 [Test build #24450 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/24450/consoleFull) for PR 3694 at commit [`f21bfd4`](https://github.com/apache/spark/commit/f21bfd4904fa340099d190bd3963fefc79f0faa4). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-4845][Core] Adding a parallelismRatio t...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3694#issuecomment-66956294 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/24450/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org