[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20297#discussion_r163119635 --- Diff: launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java --- @@ -48,14 +48,16 @@ public synchronized void disconnect() { @Override public synchronized void kill() { -disconnect(); -if (childProc != null) { - if (childProc.isAlive()) { -childProc.destroyForcibly(); +if (!isDisposed()) { + setState(State.KILLED); --- End diff -- ok makes sense. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20297#discussion_r163014278 --- Diff: launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java --- @@ -48,14 +48,16 @@ public synchronized void disconnect() { @Override public synchronized void kill() { -disconnect(); -if (childProc != null) { - if (childProc.isAlive()) { -childProc.destroyForcibly(); +if (!isDisposed()) { + setState(State.KILLED); --- End diff -- Then the comment I made in the previous PR applies. Closing the socket / killing the child can have other implications (like changing the state) and it's easier to reason about what happens if the desired state change happens first. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20297 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20297#discussion_r162851582 --- Diff: launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java --- @@ -48,14 +48,16 @@ public synchronized void disconnect() { @Override public synchronized void kill() { -disconnect(); -if (childProc != null) { - if (childProc.isAlive()) { -childProc.destroyForcibly(); +if (!isDisposed()) { + setState(State.KILLED); --- End diff -- Even the order doesn't matter, I think it's more conventional to set the state at the end. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20297#discussion_r162694343 --- Diff: launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java --- @@ -331,23 +358,27 @@ protected void handle(Message msg) throws IOException { timeout.cancel(); } close(); +if (handle != null) { + handle.dispose(); +} } finally { timeoutTimer.purge(); } } @Override public void close() throws IOException { + if (!isOpen()) { +return; + } + synchronized (clients) { clients.remove(this); } - super.close(); - if (handle != null) { -if (!handle.getState().isFinal()) { - LOG.log(Level.WARNING, "Lost connection to spark application."); - handle.setState(SparkAppHandle.State.LOST); -} -handle.disconnect(); --- End diff -- See https://github.com/apache/spark/pull/20297#pullrequestreview-89568079 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20297#discussion_r162694174 --- Diff: launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java --- @@ -331,23 +358,27 @@ protected void handle(Message msg) throws IOException { timeout.cancel(); } close(); +if (handle != null) { + handle.dispose(); +} } finally { timeoutTimer.purge(); } } @Override public void close() throws IOException { + if (!isOpen()) { +return; + } + synchronized (clients) { clients.remove(this); } - super.close(); - if (handle != null) { -if (!handle.getState().isFinal()) { - LOG.log(Level.WARNING, "Lost connection to spark application."); - handle.setState(SparkAppHandle.State.LOST); -} -handle.disconnect(); + + synchronized (this) { +super.close(); +notifyAll(); --- End diff -- See L239. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20297#discussion_r162693890 --- Diff: launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java --- @@ -95,15 +95,15 @@ protected synchronized void send(Message msg) throws IOException { } @Override - public void close() throws IOException { + public synchronized void close() throws IOException { --- End diff -- We never *needed* to change it, but the extra code wasn't doing anything useful, so I chose the simpler version. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20297#discussion_r162693731 --- Diff: launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java --- @@ -48,14 +48,16 @@ public synchronized void disconnect() { @Override public synchronized void kill() { -disconnect(); -if (childProc != null) { - if (childProc.isAlive()) { -childProc.destroyForcibly(); +if (!isDisposed()) { + setState(State.KILLED); --- End diff -- None of the calls below should raise exceptions. Even the socket close is wrapped in a try..catch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/20297#discussion_r162550217 --- Diff: launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java --- @@ -48,14 +48,16 @@ public synchronized void disconnect() { @Override public synchronized void kill() { -disconnect(); -if (childProc != null) { - if (childProc.isAlive()) { -childProc.destroyForcibly(); +if (!isDisposed()) { + setState(State.KILLED); --- End diff -- +1ï¼I have the same question in last review. We should figure it out. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
Github user gengliangwang commented on a diff in the pull request: https://github.com/apache/spark/pull/20297#discussion_r162548031 --- Diff: launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java --- @@ -95,15 +95,15 @@ protected synchronized void send(Message msg) throws IOException { } @Override - public void close() throws IOException { + public synchronized void close() throws IOException { if (!closed) { --- End diff -- => isOpen --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20297#discussion_r162525240 --- Diff: launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java --- @@ -331,23 +358,27 @@ protected void handle(Message msg) throws IOException { timeout.cancel(); } close(); +if (handle != null) { + handle.dispose(); +} } finally { timeoutTimer.purge(); } } @Override public void close() throws IOException { + if (!isOpen()) { +return; + } + synchronized (clients) { clients.remove(this); } - super.close(); - if (handle != null) { -if (!handle.getState().isFinal()) { - LOG.log(Level.WARNING, "Lost connection to spark application."); - handle.setState(SparkAppHandle.State.LOST); -} -handle.disconnect(); + + synchronized (this) { +super.close(); +notifyAll(); --- End diff -- why `notifyAll`? who might be waiting? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20297#discussion_r162525142 --- Diff: launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java --- @@ -331,23 +358,27 @@ protected void handle(Message msg) throws IOException { timeout.cancel(); } close(); +if (handle != null) { + handle.dispose(); +} } finally { timeoutTimer.purge(); } } @Override public void close() throws IOException { + if (!isOpen()) { +return; + } + synchronized (clients) { clients.remove(this); } - super.close(); - if (handle != null) { -if (!handle.getState().isFinal()) { - LOG.log(Level.WARNING, "Lost connection to spark application."); - handle.setState(SparkAppHandle.State.LOST); -} -handle.disconnect(); --- End diff -- why we don't disconnect now? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20297#discussion_r162524725 --- Diff: launcher/src/main/java/org/apache/spark/launcher/LauncherConnection.java --- @@ -95,15 +95,15 @@ protected synchronized void send(Message msg) throws IOException { } @Override - public void close() throws IOException { + public synchronized void close() throws IOException { --- End diff -- do we still need to change this method? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20297#discussion_r162524615 --- Diff: launcher/src/main/java/org/apache/spark/launcher/ChildProcAppHandle.java --- @@ -48,14 +48,16 @@ public synchronized void disconnect() { @Override public synchronized void kill() { -disconnect(); -if (childProc != null) { - if (childProc.isAlive()) { -childProc.destroyForcibly(); +if (!isDisposed()) { + setState(State.KILLED); --- End diff -- so we should set the state to `KILLED` once the `kill` method is called? Even the code below fails(throw exception), the state should still be `KILLED`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20297#discussion_r162159354 --- Diff: launcher/src/main/java/org/apache/spark/launcher/LauncherServer.java --- @@ -331,23 +358,27 @@ protected void handle(Message msg) throws IOException { timeout.cancel(); } close(); +if (handle != null) { --- End diff -- This is the fix for the new race described in the summary (the code is moved here from below). This changes behavior slightly: the handle now waits for the child process / thread to finish before disposing itself, whereas before that would happen as soon as the connection between the child process / thread was closed. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/20297#discussion_r162159436 --- Diff: resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala --- @@ -381,7 +381,9 @@ private object YarnClusterDriver extends Logging with Matchers { // Verify that the config archive is correctly placed in the classpath of all containers. val confFile = "/" + Client.SPARK_CONF_FILE - assert(getClass().getResource(confFile) != null) + if (conf.getOption(SparkLauncher.DEPLOY_MODE) == Some("cluster")) { --- End diff -- This is the bug in the YARN tests that the fix uncovered. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20297: [SPARK-23020][CORE] Fix races in launcher code, t...
GitHub user vanzin opened a pull request: https://github.com/apache/spark/pull/20297 [SPARK-23020][CORE] Fix races in launcher code, test. The race in the code is because the handle might update its state to the wrong state if the connection handling thread is still processing incoming data; so the handle needs to wait for the connection to finish up before checking the final state. The race in the test is because when waiting for a handle to reach a final state, the waitFor() method needs to wait until all handle state is updated (which also includes waiting for the connection thread above to finish). Otherwise, waitFor() may return too early, which would cause a bunch of different races (like the listener not being yet notified of the state change, or being in the middle of being notified, or the handle not being properly disposed and causing postChecks() to assert). On top of that I found, by code inspection, a couple of potential races that could make a handle end up in the wrong state when being killed. The original version of this fix introduced the flipped version of the first race described above; the connection closing might override the handle state before the handle might have a chance to do cleanup. The fix there is to only dispose of the handle from the connection when there is an error, and let the handle dispose itself in the normal case. The fix also cause a bug in YarnClusterSuite to be surfaced; the code was checking for a file in the classpath that was not expected to be there in client mode. Because of the above issues, the error was not propagating correctly and the (buggy) test was incorrectly passing. Tested by running the existing unit tests a lot (and not seeing the errors I was seeing before). You can merge this pull request into a Git repository by running: $ git pull https://github.com/vanzin/spark SPARK-23020 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20297.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20297 commit 8bde21a1cbdab3c49a85c1da960f4d9c7bf70064 Author: Marcelo VanzinDate: 2018-01-16T06:40:44Z [SPARK-23020][CORE] Fix races in launcher code, test. The race in the code is because the handle might update its state to the wrong state if the connection handling thread is still processing incoming data; so the handle needs to wait for the connection to finish up before checking the final state. The race in the test is because when waiting for a handle to reach a final state, the waitFor() method needs to wait until all handle state is updated (which also includes waiting for the connection thread above to finish). Otherwise, waitFor() may return too early, which would cause a bunch of different races (like the listener not being yet notified of the state change, or being in the middle of being notified, or the handle not being properly disposed and causing postChecks() to assert). On top of that I found, by code inspection, a couple of potential races that could make a handle end up in the wrong state when being killed. The original version of this fix introduced the flipped version of the first race described above; the connection closing might override the handle state before the handle might have a chance to do cleanup. The fix there is to only dispose of the handle from the connection when there is an error, and let the handle dispose itself in the normal case. The fix also cause a bug in YarnClusterSuite to be surfaced; the code was checking for a file in the classpath that was not expected to be there in client mode. Because of the above issues, the error was not propagating correctly and the (buggy) test was incorrectly passing. Tested by running the existing unit tests a lot (and not seeing the errors I was seeing before). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org