[jira] [Updated] (SPARK-17696) Race in CoarseGrainedExecutorBackend shutdown can lead to wrong exit status

2016-09-28 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-17696:
---
Fix Version/s: (was: 2.0.0)

> Race in CoarseGrainedExecutorBackend shutdown can lead to wrong exit status
> ---
>
> Key: SPARK-17696
> URL: https://issues.apache.org/jira/browse/SPARK-17696
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Charles Allen
>Priority: Minor
> Fix For: 1.6.3
>
>
> There's a race in the shutdown path of CoarseGrainedExecutorBackend that may 
> lead to the process exiting with the wrong status. When the race triggers, 
> you can see things like this in the driver logs in yarn-cluster mode:
> {noformat}
> 14:38:20,114 [Driver] INFO  org.apache.spark.SparkContext - Successfully 
> stopped SparkContext
> {noformat}
> And later:
> {noformat}
> 14:38:22,455 [Reporter] WARN  org.apache.spark.deploy.yarn.YarnAllocator - 
> Container marked as failed: container_1470951093505_0001_01_02 on host: 
> xxx.com. Exit status: 1. Diagnostics: Exception from container-launch.
> Container id: container_1470951093505_0001_01_02
> Exit code: 1
> {noformat}
> This happens because the user class is still running after the SparkContext 
> is shut down, so the YarnAllocator instance is alive for long enough to fetch 
> the exit status of the container. If the race is triggered, the container 
> exits with the wrong status. In this case, enough containers hit the race 
> that the application ended up failing due to too many container failures, 
> even though the app would probably succeed otherwise.
> The race is as follows:
> - CoarseGrainedExecutorBackend receives a StopExecutor
> - Before it can enqueue a "Shutdown" message, the socket is disconnected and 
> NettyRpcEnv enqueues a "RemoteProcessDisconnected" message
> - "RemoteProcessDisconnected" is processed first, and calls "System.exit" 
> with wrong exit code for this case.
> You can see that in the executor logs: both messages are being processed.
> {noformat}
> 14:38:20,093 [dispatcher-event-loop-9] INFO  
> org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver commanded a 
> shutdown
> 14:38:20,286 [dispatcher-event-loop-9] ERROR 
> org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver xxx:40988 
> disassociated! Shutting down.
> {noformat}
> The code needs to avoid this situation by ignoring the disconnect event if 
> it's already shutting down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17696) Race in CoarseGrainedExecutorBackend shutdown can lead to wrong exit status

2016-09-28 Thread Shixiong Zhu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shixiong Zhu updated SPARK-17696:
-
Assignee: Charles Allen

> Race in CoarseGrainedExecutorBackend shutdown can lead to wrong exit status
> ---
>
> Key: SPARK-17696
> URL: https://issues.apache.org/jira/browse/SPARK-17696
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Assignee: Charles Allen
>Priority: Minor
> Fix For: 1.6.3, 2.0.0
>
>
> There's a race in the shutdown path of CoarseGrainedExecutorBackend that may 
> lead to the process exiting with the wrong status. When the race triggers, 
> you can see things like this in the driver logs in yarn-cluster mode:
> {noformat}
> 14:38:20,114 [Driver] INFO  org.apache.spark.SparkContext - Successfully 
> stopped SparkContext
> {noformat}
> And later:
> {noformat}
> 14:38:22,455 [Reporter] WARN  org.apache.spark.deploy.yarn.YarnAllocator - 
> Container marked as failed: container_1470951093505_0001_01_02 on host: 
> xxx.com. Exit status: 1. Diagnostics: Exception from container-launch.
> Container id: container_1470951093505_0001_01_02
> Exit code: 1
> {noformat}
> This happens because the user class is still running after the SparkContext 
> is shut down, so the YarnAllocator instance is alive for long enough to fetch 
> the exit status of the container. If the race is triggered, the container 
> exits with the wrong status. In this case, enough containers hit the race 
> that the application ended up failing due to too many container failures, 
> even though the app would probably succeed otherwise.
> The race is as follows:
> - CoarseGrainedExecutorBackend receives a StopExecutor
> - Before it can enqueue a "Shutdown" message, the socket is disconnected and 
> NettyRpcEnv enqueues a "RemoteProcessDisconnected" message
> - "RemoteProcessDisconnected" is processed first, and calls "System.exit" 
> with wrong exit code for this case.
> You can see that in the executor logs: both messages are being processed.
> {noformat}
> 14:38:20,093 [dispatcher-event-loop-9] INFO  
> org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver commanded a 
> shutdown
> 14:38:20,286 [dispatcher-event-loop-9] ERROR 
> org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver xxx:40988 
> disassociated! Shutting down.
> {noformat}
> The code needs to avoid this situation by ignoring the disconnect event if 
> it's already shutting down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17696) Race in CoarseGrainedExecutorBackend shutdown can lead to wrong exit status

2016-09-27 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-17696:
---
Affects Version/s: (was: 2.0.0)

> Race in CoarseGrainedExecutorBackend shutdown can lead to wrong exit status
> ---
>
> Key: SPARK-17696
> URL: https://issues.apache.org/jira/browse/SPARK-17696
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> There's a race in the shutdown path of CoarseGrainedExecutorBackend that may 
> lead to the process exiting with the wrong status. When the race triggers, 
> you can see things like this in the driver logs in yarn-cluster mode:
> {noformat}
> 14:38:20,114 [Driver] INFO  org.apache.spark.SparkContext - Successfully 
> stopped SparkContext
> {noformat}
> And later:
> {noformat}
> 14:38:22,455 [Reporter] WARN  org.apache.spark.deploy.yarn.YarnAllocator - 
> Container marked as failed: container_1470951093505_0001_01_02 on host: 
> xxx.com. Exit status: 1. Diagnostics: Exception from container-launch.
> Container id: container_1470951093505_0001_01_02
> Exit code: 1
> {noformat}
> This happens because the user class is still running after the SparkContext 
> is shut down, so the YarnAllocator instance is alive for long enough to fetch 
> the exit status of the container. If the race is triggered, the container 
> exits with the wrong status. In this case, enough containers hit the race 
> that the application ended up failing due to too many container failures, 
> even though the app would probably succeed otherwise.
> The race is as follows:
> - CoarseGrainedExecutorBackend receives a StopExecutor
> - Before it can enqueue a "Shutdown" message, the socket is disconnected and 
> NettyRpcEnv enqueues a "RemoteProcessDisconnected" message
> - "RemoteProcessDisconnected" is processed first, and calls "System.exit" 
> with wrong exit code for this case.
> You can see that in the executor logs: both messages are being processed.
> {noformat}
> 14:38:20,093 [dispatcher-event-loop-9] INFO  
> org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver commanded a 
> shutdown
> 14:38:20,286 [dispatcher-event-loop-9] ERROR 
> org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver xxx:40988 
> disassociated! Shutting down.
> {noformat}
> The code needs to avoid this situation by ignoring the disconnect event if 
> it's already shutting down.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-17696) Race in CoarseGrainedExecutorBackend shutdown can lead to wrong exit status

2016-09-27 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-17696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin updated SPARK-17696:
---
Description: 
There's a race in the shutdown path of CoarseGrainedExecutorBackend that may 
lead to the process exiting with the wrong status. When the race triggers, you 
can see things like this in the driver logs in yarn-cluster mode:

{noformat}
14:38:20,114 [Driver] INFO  org.apache.spark.SparkContext - Successfully 
stopped SparkContext
{noformat}

And later:

{noformat}
14:38:22,455 [Reporter] WARN  org.apache.spark.deploy.yarn.YarnAllocator - 
Container marked as failed: container_1470951093505_0001_01_02 on host: 
xxx.com. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1470951093505_0001_01_02
Exit code: 1
{noformat}

This happens because the user class is still running after the SparkContext is 
shut down, so the YarnAllocator instance is alive for long enough to fetch the 
exit status of the container. If the race is triggered, the container exits 
with the wrong status. In this case, enough containers hit the race that the 
application ended up failing due to too many container failures, even though 
the app would probably succeed otherwise.

The race is as follows:

- CoarseGrainedExecutorBackend receives a StopExecutor
- Before it can enqueue a "Shutdown" message, the socket is disconnected and 
NettyRpcEnv enqueues a "RemoteProcessDisconnected" message
- "RemoteProcessDisconnected" is processed first, and calls "System.exit" with 
wrong exit code for this case.

You can see that in the executor logs: both messages are being processed.

{noformat}
14:38:20,093 [dispatcher-event-loop-9] INFO  
org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver commanded a 
shutdown
14:38:20,286 [dispatcher-event-loop-9] ERROR 
org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver xxx:40988 
disassociated! Shutting down.
{noformat}

The code needs to avoid this situation by ignoring the disconnect event if it's 
already shutting down.

  was:
There's a race in the shutdown path of CoarseGrainedExecutorBackend that may 
lead to the process existing with the wrong status. When the race triggers, you 
can see things like this in the driver logs in yarn-cluster mode:

{noformat}
14:38:20,114 [Driver] INFO  org.apache.spark.SparkContext - Successfully 
stopped SparkContext
{noformat}

And later:

{noformat}
14:38:22,455 [Reporter] WARN  org.apache.spark.deploy.yarn.YarnAllocator - 
Container marked as failed: container_1470951093505_0001_01_02 on host: 
xxx.com. Exit status: 1. Diagnostics: Exception from container-launch.
Container id: container_1470951093505_0001_01_02
Exit code: 1
{noformat}

This happens because the user class is still running after the SparkContext is 
shut down, so the YarnAllocator instance is alive for long enough to fetch the 
exit status of the container. If the race is triggered, the container exits 
with the wrong status. In this case, enough containers hit the race that the 
application ended up failing due to too many container failures, even though 
the app would probably succeed otherwise.

The race is as follows:

- CoarseGrainedExecutorBackend receives a StopExecutor
- Before it can enqueue a "Shutdown" message, the socket is disconnected and 
NettyRpcEnv enqueues a "RemoteProcessDisconnected" message
- "RemoteProcessDisconnected" is processed first, and calls "System.exit" with 
wrong exit code for this case.

You can see that in the executor logs: both messages are being processed.

{noformat}
14:38:20,093 [dispatcher-event-loop-9] INFO  
org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver commanded a 
shutdown
14:38:20,286 [dispatcher-event-loop-9] ERROR 
org.apache.spark.executor.CoarseGrainedExecutorBackend - Driver xxx:40988 
disassociated! Shutting down.
{noformat}

The code needs to avoid this situation by ignoring the disconnect event if it's 
already shutting down.


> Race in CoarseGrainedExecutorBackend shutdown can lead to wrong exit status
> ---
>
> Key: SPARK-17696
> URL: https://issues.apache.org/jira/browse/SPARK-17696
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, YARN
>Affects Versions: 1.6.0, 2.0.0
>Reporter: Marcelo Vanzin
>Priority: Minor
>
> There's a race in the shutdown path of CoarseGrainedExecutorBackend that may 
> lead to the process exiting with the wrong status. When the race triggers, 
> you can see things like this in the driver logs in yarn-cluster mode:
> {noformat}
> 14:38:20,114 [Driver] INFO  org.apache.spark.SparkContext - Successfully 
> stopped SparkContext
> {noformat}
> And later:
> {noformat}
> 14:38:22,455 [Reporter] WARN  org.apache.spark.deploy.yarn.YarnAllocator - 
>