[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-22 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20297#discussion_r163119635
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java ---
@@ -48,14 +48,16 @@ public synchronized void disconnect() {
 
   @Override
   public synchronized void kill() {
-disconnect();
-if (childProc != null) {
-  if (childProc.isAlive()) {
-childProc.destroyForcibly();
+if (!isDisposed()) {
+  setState(State.KILLED);
--- End diff --

ok makes sense.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-22 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20297#discussion_r163014278
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java ---
@@ -48,14 +48,16 @@ public synchronized void disconnect() {
 
   @Override
   public synchronized void kill() {
-disconnect();
-if (childProc != null) {
-  if (childProc.isAlive()) {
-childProc.destroyForcibly();
+if (!isDisposed()) {
+  setState(State.KILLED);
--- End diff --

Then the comment I made in the previous PR applies. Closing the socket / 
killing the child can have other implications (like changing the state) and 
it's easier to reason about what happens if the desired state change happens 
first.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20297


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20297#discussion_r162851582
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java ---
@@ -48,14 +48,16 @@ public synchronized void disconnect() {
 
   @Override
   public synchronized void kill() {
-disconnect();
-if (childProc != null) {
-  if (childProc.isAlive()) {
-childProc.destroyForcibly();
+if (!isDisposed()) {
+  setState(State.KILLED);
--- End diff --

Even the order doesn't matter, I think it's more conventional to set the 
state at the end.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-19 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20297#discussion_r162694343
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java ---
@@ -331,23 +358,27 @@ protected void handle(Message msg) throws IOException 
{
   timeout.cancel();
 }
 close();
+if (handle != null) {
+  handle.dispose();
+}
   } finally {
 timeoutTimer.purge();
   }
 }
 
 @Override
 public void close() throws IOException {
+  if (!isOpen()) {
+return;
+  }
+
   synchronized (clients) {
 clients.remove(this);
   }
-  super.close();
-  if (handle != null) {
-if (!handle.getState().isFinal()) {
-  LOG.log(Level.WARNING, "Lost connection to spark application.");
-  handle.setState(SparkAppHandle.State.LOST);
-}
-handle.disconnect();
--- End diff --

See https://github.com/apache/spark/pull/20297#pullrequestreview-89568079


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-19 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20297#discussion_r162694174
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java ---
@@ -331,23 +358,27 @@ protected void handle(Message msg) throws IOException 
{
   timeout.cancel();
 }
 close();
+if (handle != null) {
+  handle.dispose();
+}
   } finally {
 timeoutTimer.purge();
   }
 }
 
 @Override
 public void close() throws IOException {
+  if (!isOpen()) {
+return;
+  }
+
   synchronized (clients) {
 clients.remove(this);
   }
-  super.close();
-  if (handle != null) {
-if (!handle.getState().isFinal()) {
-  LOG.log(Level.WARNING, "Lost connection to spark application.");
-  handle.setState(SparkAppHandle.State.LOST);
-}
-handle.disconnect();
+
+  synchronized (this) {
+super.close();
+notifyAll();
--- End diff --

See L239.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-19 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20297#discussion_r162693890
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java ---
@@ -95,15 +95,15 @@ protected synchronized void send(Message msg) throws 
IOException {
   }
 
   @Override
-  public void close() throws IOException {
+  public synchronized void close() throws IOException {
--- End diff --

We never *needed* to change it, but the extra code wasn't doing anything 
useful, so I chose the simpler version.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-19 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20297#discussion_r162693731
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java ---
@@ -48,14 +48,16 @@ public synchronized void disconnect() {
 
   @Override
   public synchronized void kill() {
-disconnect();
-if (childProc != null) {
-  if (childProc.isAlive()) {
-childProc.destroyForcibly();
+if (!isDisposed()) {
+  setState(State.KILLED);
--- End diff --

None of the calls below should raise exceptions. Even the socket close is 
wrapped in a try..catch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-18 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/20297#discussion_r162550217
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java ---
@@ -48,14 +48,16 @@ public synchronized void disconnect() {
 
   @Override
   public synchronized void kill() {
-disconnect();
-if (childProc != null) {
-  if (childProc.isAlive()) {
-childProc.destroyForcibly();
+if (!isDisposed()) {
+  setState(State.KILLED);
--- End diff --

+1,I have the same question in last review. We should figure it out.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-18 Thread gengliangwang
Github user gengliangwang commented on a diff in the pull request:

https://github.com/apache/spark/pull/20297#discussion_r162548031
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java ---
@@ -95,15 +95,15 @@ protected synchronized void send(Message msg) throws 
IOException {
   }
 
   @Override
-  public void close() throws IOException {
+  public synchronized void close() throws IOException {
 if (!closed) {
--- End diff --

=> isOpen


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20297#discussion_r162525240
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java ---
@@ -331,23 +358,27 @@ protected void handle(Message msg) throws IOException 
{
   timeout.cancel();
 }
 close();
+if (handle != null) {
+  handle.dispose();
+}
   } finally {
 timeoutTimer.purge();
   }
 }
 
 @Override
 public void close() throws IOException {
+  if (!isOpen()) {
+return;
+  }
+
   synchronized (clients) {
 clients.remove(this);
   }
-  super.close();
-  if (handle != null) {
-if (!handle.getState().isFinal()) {
-  LOG.log(Level.WARNING, "Lost connection to spark application.");
-  handle.setState(SparkAppHandle.State.LOST);
-}
-handle.disconnect();
+
+  synchronized (this) {
+super.close();
+notifyAll();
--- End diff --

why `notifyAll`? who might be waiting?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20297#discussion_r162525142
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java ---
@@ -331,23 +358,27 @@ protected void handle(Message msg) throws IOException 
{
   timeout.cancel();
 }
 close();
+if (handle != null) {
+  handle.dispose();
+}
   } finally {
 timeoutTimer.purge();
   }
 }
 
 @Override
 public void close() throws IOException {
+  if (!isOpen()) {
+return;
+  }
+
   synchronized (clients) {
 clients.remove(this);
   }
-  super.close();
-  if (handle != null) {
-if (!handle.getState().isFinal()) {
-  LOG.log(Level.WARNING, "Lost connection to spark application.");
-  handle.setState(SparkAppHandle.State.LOST);
-}
-handle.disconnect();
--- End diff --

why we don't disconnect now?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20297#discussion_r162524725
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java ---
@@ -95,15 +95,15 @@ protected synchronized void send(Message msg) throws 
IOException {
   }
 
   @Override
-  public void close() throws IOException {
+  public synchronized void close() throws IOException {
--- End diff --

do we still need to change this method?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20297#discussion_r162524615
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java ---
@@ -48,14 +48,16 @@ public synchronized void disconnect() {
 
   @Override
   public synchronized void kill() {
-disconnect();
-if (childProc != null) {
-  if (childProc.isAlive()) {
-childProc.destroyForcibly();
+if (!isDisposed()) {
+  setState(State.KILLED);
--- End diff --

so we should set the state to `KILLED` once the `kill` method is called? 
Even the code below fails(throw exception), the state should still be `KILLED`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-17 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20297#discussion_r162159354
  
--- Diff: 
launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java ---
@@ -331,23 +358,27 @@ protected void handle(Message msg) throws IOException 
{
   timeout.cancel();
 }
 close();
+if (handle != null) {
--- End diff --

This is the fix for the new race described in the summary (the code is 
moved here from below).

This changes behavior slightly: the handle now waits for the child process 
/ thread to finish before disposing itself, whereas before that would happen as 
soon as the connection between the child process / thread was closed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-17 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20297#discussion_r162159436
  
--- Diff: 
resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala
 ---
@@ -381,7 +381,9 @@ private object YarnClusterDriver extends Logging with 
Matchers {
 
   // Verify that the config archive is correctly placed in the 
classpath of all containers.
   val confFile = "/" + Client.SPARK_CONF_FILE
-  assert(getClass().getResource(confFile) != null)
+  if (conf.getOption(SparkLauncher.DEPLOY_MODE) == Some("cluster")) {
--- End diff --

This is the bug in the YARN tests that the fix uncovered.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...

2018-01-17 Thread vanzin
GitHub user vanzin opened a pull request:

https://github.com/apache/spark/pull/20297

[SPARK-23020][CORE] Fix races in launcher code, test.

The race in the code is because the handle might update
its state to the wrong state if the connection handling
thread is still processing incoming data; so the handle
needs to wait for the connection to finish up before
checking the final state.

The race in the test is because when waiting for a handle
to reach a final state, the waitFor() method needs to wait
until all handle state is updated (which also includes
waiting for the connection thread above to finish).
Otherwise, waitFor() may return too early, which would cause
a bunch of different races (like the listener not being yet
notified of the state change, or being in the middle of
being notified, or the handle not being properly disposed
and causing postChecks() to assert).

On top of that I found, by code inspection, a couple of
potential races that could make a handle end up in the
wrong state when being killed.

The original version of this fix introduced the flipped
version of the first race described above; the connection
closing might override the handle state before the
handle might have a chance to do cleanup. The fix there
is to only dispose of the handle from the connection
when there is an error, and let the handle dispose
itself in the normal case.

The fix also cause a bug in YarnClusterSuite to be surfaced;
the code was checking for a file in the classpath that was
not expected to be there in client mode. Because of the above
issues, the error was not propagating correctly and the (buggy)
test was incorrectly passing.

Tested by running the existing unit tests a lot (and not
seeing the errors I was seeing before).

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/vanzin/spark SPARK-23020

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20297.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20297


commit 8bde21a1cbdab3c49a85c1da960f4d9c7bf70064
Author: Marcelo Vanzin 
Date:   2018-01-16T06:40:44Z

[SPARK-23020][CORE] Fix races in launcher code, test.

The race in the code is because the handle might update
its state to the wrong state if the connection handling
thread is still processing incoming data; so the handle
needs to wait for the connection to finish up before
checking the final state.

The race in the test is because when waiting for a handle
to reach a final state, the waitFor() method needs to wait
until all handle state is updated (which also includes
waiting for the connection thread above to finish).
Otherwise, waitFor() may return too early, which would cause
a bunch of different races (like the listener not being yet
notified of the state change, or being in the middle of
being notified, or the handle not being properly disposed
and causing postChecks() to assert).

On top of that I found, by code inspection, a couple of
potential races that could make a handle end up in the
wrong state when being killed.

The original version of this fix introduced the flipped
version of the first race described above; the connection
closing might override the handle state before the
handle might have a chance to do cleanup. The fix there
is to only dispose of the handle from the connection
when there is an error, and let the handle dispose
itself in the normal case.

The fix also cause a bug in YarnClusterSuite to be surfaced;
the code was checking for a file in the classpath that was
not expected to be there in client mode. Because of the above
issues, the error was not propagating correctly and the (buggy)
test was incorrectly passing.

Tested by running the existing unit tests a lot (and not
seeing the errors I was seeing before).




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org