[jira] [Assigned] (SPARK-17022) Potential deadlock in driver handling message
[ https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17022: Assignee: Apache Spark > Potential deadlock in driver handling message > - > > Key: SPARK-17022 > URL: https://issues.apache.org/jira/browse/SPARK-17022 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0 >Reporter: Tao Wang >Assignee: Apache Spark >Priority: Critical > > Suggest t1 < t2 < t3 > At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one > of three functions: CoarseGrainedSchedulerBackend.killExecutors, > CoarseGrainedSchedulerBackend.requestTotalExecutors or > CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the > lock `CoarseGrainedSchedulerBackend`. > Then YarnSchedulerBackend.doRequestTotalExecutors will send a > RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply. > At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the > message is received by the endpoint. > At t3, the RequestExexutor message sent at t1 is received by the endpoint. > Then the endpoint would first handle RemoveExecutor then the RequestExecutor > message. > When handling RemoveExecutor, it would send the same message to > `driverEndpoint` and wait for reply. > In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to > handle that message, while the lock has been occupied in t1. > So it would cause a deadlock. > We have found the issue in our deployment, it would block the driver to make > it handle no messages until the two message all went timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-17022) Potential deadlock in driver handling message
[ https://issues.apache.org/jira/browse/SPARK-17022?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-17022: Assignee: (was: Apache Spark) > Potential deadlock in driver handling message > - > > Key: SPARK-17022 > URL: https://issues.apache.org/jira/browse/SPARK-17022 > Project: Spark > Issue Type: Bug > Components: YARN >Affects Versions: 1.5.0, 1.5.1, 1.5.2, 1.6.0, 1.6.1, 2.0.0 >Reporter: Tao Wang >Priority: Critical > > Suggest t1 < t2 < t3 > At t1, someone called YarnSchedulerBackend.doRequestTotalExecutors from one > of three functions: CoarseGrainedSchedulerBackend.killExecutors, > CoarseGrainedSchedulerBackend.requestTotalExecutors or > CoarseGrainedSchedulerBackend.requestExecutors, in all of which will hold the > lock `CoarseGrainedSchedulerBackend`. > Then YarnSchedulerBackend.doRequestTotalExecutors will send a > RequestExecutors message to `yarnSchedulerEndpoint` and wait for reply. > At t2, someone send a RemoveExecutor to `yarnSchedulerEndpoint` and the > message is received by the endpoint. > At t3, the RequestExexutor message sent at t1 is received by the endpoint. > Then the endpoint would first handle RemoveExecutor then the RequestExecutor > message. > When handling RemoveExecutor, it would send the same message to > `driverEndpoint` and wait for reply. > In `driverEndpoint` it will request lock `CoarseGrainedSchedulerBackend` to > handle that message, while the lock has been occupied in t1. > So it would cause a deadlock. > We have found the issue in our deployment, it would block the driver to make > it handle no messages until the two message all went timeout. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org