[GitHub] spark issue #13482: SPARK-15725: Ensure ApplicationMaster sleeps for the min...

2016-06-08 Thread vanzin
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/13482 > So why don't we just take out the notifyAll call when we get a GetExecutorLossReason? If that helps it's ok too. It would probably increase a little bit the time for the driver to know

[GitHub] spark issue #13482: SPARK-15725: Ensure ApplicationMaster sleeps for the min...

2016-06-06 Thread tgravescs
Github user tgravescs commented on the issue: https://github.com/apache/spark/pull/13482 So why don't we just take out the notifyAll call when we get a GetExecutorLossReason? We can add a parameter to resetAllocatorInterval() and resets the interval it but doesn't call the

[GitHub] spark issue #13482: SPARK-15725: Ensure ApplicationMaster sleeps for the min...

2016-06-06 Thread andrewor14
Github user andrewor14 commented on the issue: https://github.com/apache/spark/pull/13482 I think this is important to fix for 2.0 but I personally found the changes in this patch rather confusing. If there's a simpler workaround we could do (such as the solution I suggested, if that

[GitHub] spark issue #13482: SPARK-15725: Ensure ApplicationMaster sleeps for the min...

2016-06-06 Thread andrewor14
Github user andrewor14 commented on the issue: https://github.com/apache/spark/pull/13482 @rdblue the reason for the hang is the `GetExecutorLossReason` right? AFAIK we send one to the AM every time an executor dies. What if we just keep a set of executor IDs we're waiting to kill on

[GitHub] spark issue #13482: SPARK-15725: Ensure ApplicationMaster sleeps for the min...

2016-06-06 Thread vanzin
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/13482 Seems like an ok workaround to me; we really should spend some time looking at removing some of those locks and avoiding `askWithRetry` (which shouldn't ever be needed with a reliable RPC library

[GitHub] spark issue #13482: SPARK-15725: Ensure ApplicationMaster sleeps for the min...

2016-06-06 Thread vanzin
Github user vanzin commented on the issue: https://github.com/apache/spark/pull/13482 @rdblue could you follow the usual convention in the pr title (`[SPARK-15725][yarn] Blah`)? thx --- If your project is set up for it, you can reply to this email and have your reply appear on

[GitHub] spark issue #13482: SPARK-15725: Ensure ApplicationMaster sleeps for the min...

2016-06-04 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/13482 cc @vanzin @tgravescs --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes

[GitHub] spark issue #13482: SPARK-15725: Ensure ApplicationMaster sleeps for the min...

2016-06-02 Thread rdblue
Github user rdblue commented on the issue: https://github.com/apache/spark/pull/13482 @yhuai, @rxin, we should consider this work-around for 2.0 if it isn't too late. We see a lot of apps fail because the driver and AM lock up. --- If your project is set up for it, you can reply to

[GitHub] spark issue #13482: SPARK-15725: Ensure ApplicationMaster sleeps for the min...

2016-06-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13482 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark issue #13482: SPARK-15725: Ensure ApplicationMaster sleeps for the min...

2016-06-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue: https://github.com/apache/spark/pull/13482 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/59895/ Test PASSed. ---

[GitHub] spark issue #13482: SPARK-15725: Ensure ApplicationMaster sleeps for the min...

2016-06-02 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13482 **[Test build #59895 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59895/consoleFull)** for PR 13482 at commit

[GitHub] spark issue #13482: SPARK-15725: Ensure ApplicationMaster sleeps for the min...

2016-06-02 Thread SparkQA
Github user SparkQA commented on the issue: https://github.com/apache/spark/pull/13482 **[Test build #59895 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/59895/consoleFull)** for PR 13482 at commit