[jira] [Commented] (SPARK-17022) Potential deadlock in driver handling message

2016-08-12 Thread Tao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418488#comment-15418488
 ] 

Tao Wang commented on SPARK-17022:
--

looks like they are related, especially with SPARK-16702.

> Potential deadlock in driver handling message
> -
>
> Key: SPARK-17022
> URL: https://issues.apache.org/jira/browse/SPARK-17022
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0
>Reporter: Tao Wang
>Assignee: Tao Wang
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Suggest t1 < t2 < t3 
> At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one 
> of three functions: CoarseGrainedSchedulerBackend.killExecutors, 
> CoarseGrainedSchedulerBackend.requestTotalExecutors or 
> CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the 
> lock `CoarseGrainedSchedulerBackend`.
> Then YarnSchedulerBackend.doRequestTotalExecutors will send a 
> RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply.
> At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the 
> message is received by the endpoint.
> At t3, the RequestExexutor message sent at t1 is received by the endpoint.
> Then the endpoint would first handle RemoveExecutor then the RequestExecutor 
> message.
> When handling RemoveExecutor, it would send the same message to 
> `driverEndpoint` and wait for reply.
> In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to 
> handle that message, while the lock has been occupied in t1.
> So it would cause a deadlock.
> We have found the issue in our deployment, it would block the driver to make 
> it handle no messages until the two message all went timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17022) Potential deadlock in driver handling message

2016-08-12 Thread Jason Moore (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15418470#comment-15418470
 ] 

Jason Moore commented on SPARK-17022:
-

This one is maybe related to SPARK-16533 and/or SPARK-16702, right?  My team 
works in an environment where preemption (and killing of executors) is a common 
occurrence, so have been burnt a bit by this one.   We had been putting 
together a patch, but I'll see how this one holds up.

> Potential deadlock in driver handling message
> -
>
> Key: SPARK-17022
> URL: https://issues.apache.org/jira/browse/SPARK-17022
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0
>Reporter: Tao Wang
>Assignee: Tao Wang
>Priority: Critical
> Fix For: 2.0.1, 2.1.0
>
>
> Suggest t1 < t2 < t3 
> At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one 
> of three functions: CoarseGrainedSchedulerBackend.killExecutors, 
> CoarseGrainedSchedulerBackend.requestTotalExecutors or 
> CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the 
> lock `CoarseGrainedSchedulerBackend`.
> Then YarnSchedulerBackend.doRequestTotalExecutors will send a 
> RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply.
> At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the 
> message is received by the endpoint.
> At t3, the RequestExexutor message sent at t1 is received by the endpoint.
> Then the endpoint would first handle RemoveExecutor then the RequestExecutor 
> message.
> When handling RemoveExecutor, it would send the same message to 
> `driverEndpoint` and wait for reply.
> In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to 
> handle that message, while the lock has been occupied in t1.
> So it would cause a deadlock.
> We have found the issue in our deployment, it would block the driver to make 
> it handle no messages until the two message all went timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17022) Potential deadlock in driver handling message

2016-08-11 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15417582#comment-15417582
 ] 

Apache Spark commented on SPARK-17022:
--

User 'WangTaoTheTonic' has created a pull request for this issue:
https://github.com/apache/spark/pull/14605

> Potential deadlock in driver handling message
> -
>
> Key: SPARK-17022
> URL: https://issues.apache.org/jira/browse/SPARK-17022
> Project: Spark
>  Issue Type: Bug
>  Components: YARN
>Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0
>Reporter: Tao Wang
>Priority: Critical
>
> Suggest t1 < t2 < t3 
> At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one 
> of three functions: CoarseGrainedSchedulerBackend.killExecutors, 
> CoarseGrainedSchedulerBackend.requestTotalExecutors or 
> CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the 
> lock `CoarseGrainedSchedulerBackend`.
> Then YarnSchedulerBackend.doRequestTotalExecutors will send a 
> RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply.
> At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the 
> message is received by the endpoint.
> At t3, the RequestExexutor message sent at t1 is received by the endpoint.
> Then the endpoint would first handle RemoveExecutor then the RequestExecutor 
> message.
> When handling RemoveExecutor, it would send the same message to 
> `driverEndpoint` and wait for reply.
> In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to 
> handle that message, while the lock has been occupied in t1.
> So it would cause a deadlock.
> We have found the issue in our deployment, it would block the driver to make 
> it handle no messages until the two message all went timeout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org