Repository: spark
Updated Branches:
  refs/heads/master 496d2a2b4 -> 9412547e7


[SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succeeds to fix 
failure on slow machines

## What changes were proposed in this pull request?

I'm seeing several PR builder builds fail after 
https://github.com/apache/spark/pull/11725/files. Example:

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.4/lastFailedBuild/console

```
testCommunication(org.apache.spark.launcher.LauncherServerSuite)  Time elapsed: 
0.023 sec  <<< FAILURE!
java.lang.AssertionError: expected:<app-id> but was:<null>
        at 
org.apache.spark.launcher.LauncherServerSuite.testCommunication(LauncherServerSuite.java:93)
```

However, other builds pass this same test, including the test when run locally 
and on the Jenkins PR builder. The failure itself concerns a change to how the 
test waits on a condition, and the wait can time out; therefore I think this is 
due to fast/slow machine differences.

This is an attempt at a hot fix; it's a little hard to verify since locally and 
on the PR builder, it passes anyway. The change itself should be harmless 
anyway.

Why didn't this happen before, if the new logic was supposed to be equivalent 
to the old? I think this is the sequence:

- First attempt to acquire semaphore for 10ms actually silently times out
- The changed being waited for happens just after that, a bit too late
- Assertion passes since condition became true just in time
- `release()` fires from the listener
- Next `tryAcquire` however immediately succeeds because the first `tryAcquire` 
didn't acquire anything, but its subsequent condition is not yet true; this 
would explain why the second one always fails

Versus the original using `notifyAll()`, there's a small difference: 
`wait()`-ing after `notifyAll()` just results in another wait; it doesn't make 
it return immediately. So this was a tiny latent issue that was masked by the 
semantics. Now the test asserts that the event actually happened (semaphore was 
acquired). (The timeout is still here to prevent the test from hanging forever, 
and to detect really slow response.) The timeout is increased to a second to 
allow plenty of time anyway.

## How was this patch tested?

Jenkins tests

Author: Sean Owen <so...@cloudera.com>

Closes #11763 from srowen/SPARK-13823.3.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/9412547e
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/9412547e
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/9412547e

Branch: refs/heads/master
Commit: 9412547e7a58c40e7be5acacaf15cc48057cb18a
Parents: 496d2a2
Author: Sean Owen <so...@cloudera.com>
Authored: Wed Mar 16 16:11:24 2016 +0000
Committer: Sean Owen <so...@cloudera.com>
Committed: Wed Mar 16 16:11:24 2016 +0000

----------------------------------------------------------------------
 .../java/org/apache/spark/launcher/LauncherServerSuite.java    | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/9412547e/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java
----------------------------------------------------------------------
diff --git 
a/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java 
b/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java
index 13f72b7..5bf2bab 100644
--- a/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java
+++ b/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java
@@ -83,17 +83,17 @@ public class LauncherServerSuite extends BaseSuite {
 
       client = new TestClient(s);
       client.send(new Hello(handle.getSecret(), "1.4.0"));
-      semaphore.tryAcquire(10, TimeUnit.MILLISECONDS);
+      assertTrue(semaphore.tryAcquire(1, TimeUnit.SECONDS));
 
       // Make sure the server matched the client to the handle.
       assertNotNull(handle.getConnection());
 
       client.send(new SetAppId("app-id"));
-      semaphore.tryAcquire(10, TimeUnit.MILLISECONDS);
+      assertTrue(semaphore.tryAcquire(1, TimeUnit.SECONDS));
       assertEquals("app-id", handle.getAppId());
 
       client.send(new SetState(SparkAppHandle.State.RUNNING));
-      semaphore.tryAcquire(10, TimeUnit.MILLISECONDS);
+      assertTrue(semaphore.tryAcquire(1, TimeUnit.SECONDS));
       assertEquals(SparkAppHandle.State.RUNNING, handle.getState());
 
       handle.stop();


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

Reply via email to