[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-12-17 Thread marmbrus
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3079#issuecomment-67382177 hi @erikerlandson, thanks for working on this. It would be great to have a solution to this long running problem. Since it looks like there is still some work to be

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-12-17 Thread marmbrus
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3079#issuecomment-67382250 Do please reopen though once you having something that is passing tests :) --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-12-17 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/3079 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-20 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/3079#issuecomment-63881800 For reference, this other issue has some overlap: https://issues.apache.org/jira/browse/SPARK-4514 --- If your project is set up for it, you can reply to this

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-09 Thread squito
Github user squito commented on a diff in the pull request: https://github.com/apache/spark/pull/3079#discussion_r20062337 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -113,8 +117,12 @@ class RangePartitioner[K : Ordering : ClassTag, V]( private

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-04 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3079#issuecomment-61675397 [Test build #22880 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22880/consoleFull) for PR 3079 at commit

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3079#issuecomment-61675448 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-04 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3079#issuecomment-61675446 [Test build #22880 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22880/consoleFull) for PR 3079 at commit

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-04 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3079#issuecomment-61704937 [Test build #22892 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22892/consoleFull) for PR 3079 at commit

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3079#issuecomment-61719975 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-04 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3079#issuecomment-61719969 [Test build #22892 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22892/consoleFull) for PR 3079 at commit

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-03 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-61508261 @marmbrus, FWIW, the `correlationoptimizer14` test appears to be working for me. I ran it using: `env _RUN_SQL_TESTS=true _SQL_TESTS_ONLY=true ./dev/run-tests

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-03 Thread erikerlandson
GitHub user erikerlandson opened a pull request: https://github.com/apache/spark/pull/3079 [SPARK-1021] Defer the data-driven computation of partition bounds in so... ...rtByKey() until evaluation. You can merge this pull request into a Git repository by running: $ git pull

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-03 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/3079#issuecomment-61555496 Reboot of #1689 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-03 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3079#issuecomment-61556278 [Test build #22828 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22828/consoleFull) for PR 3079 at commit

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-03 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/3079#issuecomment-61556289 @marmbrus, @scwf, FWIW, the `correlationoptimizer14` test appears to be working for me. I ran it using: `env _RUN_SQL_TESTS=true _SQL_TESTS_ONLY=true

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-03 Thread marmbrus
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/3079#issuecomment-61556754 @erikerlandson I think you also need -Phive for the tests to run. It is possible some other things changed (or even that that test case changed with the upgrade to

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3079#issuecomment-61565401 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-11-03 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3079#issuecomment-61565392 [Test build #22828 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22828/consoleFull) for PR 3079 at commit

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-28 Thread marmbrus
Github user marmbrus commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-57106886 Since this PR was merged the correlationoptimizer14 test has been hanging. We might want to consider rolling back. You can reproduce the problem as follows: `sbt

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-28 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-57108427 I reverted this commit. @erikerlandson mind taking a look at this problem? --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-28 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-57110142 @rxin @marmbrus I will check it out --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-27 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-57043705 Actually I looked at it again. I don't think it would block the scheduler because we compute partitions outside the scheduler thread. This approach looks good to me! ---

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-27 Thread rxin
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1689#discussion_r18122197 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -222,7 +228,8 @@ class RangePartitioner[K : Ordering : ClassTag, V]( }

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-27 Thread rxin
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1689#discussion_r18122212 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V]( private var

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-27 Thread rxin
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1689#discussion_r18122214 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V]( private var

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-27 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-57043822 @erikerlandson i'm going to merge this first. Maybe we can do the cleanup later. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-27 Thread asfgit
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1689 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-27 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-57043862 BTW one thing that would be great to add is a test that makes sure we don't block the main dag scheduler thread. The reason I think we don't block is that we call

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-27 Thread markhamstra
Github user markhamstra commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-57043930 Have either of you thought about how to coordinate this with Josh's work on SPARK-3626? https://github.com/apache/spark/pull/2482 --- If your project is set up for

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-16 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-55797086 Yea I don't think we need to fully solve 3 here. My main concern with these set of changes is 2, since a single badly behaved RDD can potentially block the

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-16 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-55805772 So far the best idea I have for (2) is to set some kind of time-out on the evaluation. The bound computation uses subsampling that will (when all goes well) cap

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-15 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-55627362 Hi @rxin, 1) SimpleFutureAction is still referred to in submitJob method, but that doesn't appear to be invoked anywhere. I was reluctant to get rid of

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-15 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-55628401 Or, maybe just look into playing the same game with the cogrouped RDDs that I did with sortByKey. Don't get into invoking `defaultPartitioner` until somebody

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-12 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-55456438 Jenkins, test this please. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-12 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-55457077 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20236/consoleFull) for PR 1689 at commit

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-12 Thread rxin
Github user rxin commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-55458226 @erikerlandson thanks for looking at this. A few questions: 1. After this pull request, does anything still use SimpleFutureAction? 2. If I understand

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-12 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-55464403 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20236/consoleFull) for PR 1689 at commit

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-09-05 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-54694535 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-16 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-52397817 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18675/consoleFull) for PR 1689 at commit

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-16 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-52400243 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18675/consoleFull) for PR 1689 at commit

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-15 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-52336221 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18615/consoleFull) for PR 1689 at commit

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-15 Thread erikerlandson
Github user erikerlandson commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-52336202 Latest push updates RangePartition sampling job to be async, and updates the async action functions so that they will properly enclose the sampling job induced by

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-15 Thread markhamstra
Github user markhamstra commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-52339006 Excellent! I'll try to find some time to review this soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-15 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-52342401 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18615/consoleFull) for PR 1689 at commit

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-07 Thread rxin
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1689#discussion_r15919599 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V]( private var

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-07 Thread erikerlandson
Github user erikerlandson commented on a diff in the pull request: https://github.com/apache/spark/pull/1689#discussion_r15931609 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -113,8 +113,12 @@ class RangePartitioner[K : Ordering : ClassTag, V](

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1689#discussion_r15900503 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -113,8 +113,13 @@ class RangePartitioner[K : Ordering : ClassTag, V]( private var

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-06 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-51421389 QA tests have started for PR 1689. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/18089/consoleFull ---

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-06 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-51424177 QA results for PR 1689:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-08-06 Thread rxin
Github user rxin commented on a diff in the pull request: https://github.com/apache/spark/pull/1689#discussion_r15919352 --- Diff: core/src/main/scala/org/apache/spark/Partitioner.scala --- @@ -222,7 +228,8 @@ class RangePartitioner[K : Ordering : ClassTag, V]( }

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-07-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-50765803 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-07-31 Thread erikerlandson
GitHub user erikerlandson opened a pull request: https://github.com/apache/spark/pull/1689 [SPARK-1021] Defer the data-driven computation of partition bounds in so... ...rtByKey() until evaluation. You can merge this pull request into a Git repository by running: $ git pull

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-07-31 Thread JoshRosen
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-50824343 Jenkins, this is ok to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-50824621 QA tests have started for PR 1689. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17611/consoleFull ---

[GitHub] spark pull request: [SPARK-1021] Defer the data-driven computation...

2014-07-31 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1689#issuecomment-50829158 QA results for PR 1689:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test