[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178653#comment-15178653 ] Vinod Kumar Vavilapalli commented on YARN-1489: --- bq. That and the "Old running containers don't know where the new AM is running." issue is big enough that we shouldn't close this umbrella as done. Just filed YARN-4758. > [Umbrella] Work-preserving ApplicationMaster restart > > > Key: YARN-1489 > URL: https://issues.apache.org/jira/browse/YARN-1489 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > Attachments: Work preserving AM restart.pdf > > > Today if AMs go down, > - RM kills all the containers of that ApplicationAttempt > - New ApplicationAttempt doesn't know where the previous containers are > running > - Old running containers don't know where the new AM is running. > We need to fix this to enable work-preserving AM restart. The later two > potentially can be done at the app level, but it is good to have a common > solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15167132#comment-15167132 ] Junping Du commented on YARN-1489: -- Another big problem is we don't actually notify a restarted AM the "finished-in-the-interim" containers. In YARN-1041, RM only report the running containers to AM new attempt, but if container get finished during this time, new AM attempt has no ways to know it - that make different behaviors for AMs - MR will rely on job history log to recover tasks, while Distributed Shell will launch these (finished) containers again. I think we should enhance this. Will file a separated JIRA if we don't have it yet. > [Umbrella] Work-preserving ApplicationMaster restart > > > Key: YARN-1489 > URL: https://issues.apache.org/jira/browse/YARN-1489 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > Attachments: Work preserving AM restart.pdf > > > Today if AMs go down, > - RM kills all the containers of that ApplicationAttempt > - New ApplicationAttempt doesn't know where the previous containers are > running > - Old running containers don't know where the new AM is running. > We need to fix this to enable work-preserving AM restart. The later two > potentially can be done at the app level, but it is good to have a common > solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15162794#comment-15162794 ] Junping Du commented on YARN-1489: -- bq. That and the "Old running containers don't know where the new AM is running." issue is big enough that we shouldn't close this umbrella as done. I don't think we have an open JIRA under this umbrella to track this issue. Is this a specific issue for MR (like we discussed on MAPREDUCE-6608) or a generic issue for other frameworks (Spark, etc.) too? YARN-4602 get filed to track this issue as a generic problem for messages pass between containers. > [Umbrella] Work-preserving ApplicationMaster restart > > > Key: YARN-1489 > URL: https://issues.apache.org/jira/browse/YARN-1489 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > Attachments: Work preserving AM restart.pdf > > > Today if AMs go down, > - RM kills all the containers of that ApplicationAttempt > - New ApplicationAttempt doesn't know where the previous containers are > running > - Old running containers don't know where the new AM is running. > We need to fix this to enable work-preserving AM restart. The later two > potentially can be done at the app level, but it is good to have a common > solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15160061#comment-15160061 ] Vinod Kumar Vavilapalli commented on YARN-1489: --- That and the "Old running containers don't know where the new AM is running." issue is big enough that we shouldn't close this umbrella as done. > [Umbrella] Work-preserving ApplicationMaster restart > > > Key: YARN-1489 > URL: https://issues.apache.org/jira/browse/YARN-1489 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > Attachments: Work preserving AM restart.pdf > > > Today if AMs go down, > - RM kills all the containers of that ApplicationAttempt > - New ApplicationAttempt doesn't know where the previous containers are > running > - Old running containers don't know where the new AM is running. > We need to fix this to enable work-preserving AM restart. The later two > potentially can be done at the app level, but it is good to have a common > solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096879#comment-15096879 ] Karthik Kambatla commented on YARN-1489: The two unassigned open JIRAs still seem very valid. One of them might be related to YARN-1815. I am comfortable with converting them to issues and closing the umbrella JIRA. > [Umbrella] Work-preserving ApplicationMaster restart > > > Key: YARN-1489 > URL: https://issues.apache.org/jira/browse/YARN-1489 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > Attachments: Work preserving AM restart.pdf > > > Today if AMs go down, > - RM kills all the containers of that ApplicationAttempt > - New ApplicationAttempt doesn't know where the previous containers are > running > - Old running containers don't know where the new AM is running. > We need to fix this to enable work-preserving AM restart. The later two > potentially can be done at the app level, but it is good to have a common > solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096860#comment-15096860 ] Junping Du commented on YARN-1489: -- Hi [~vinodkv], [~jianhe] and [~ka...@cloudera.com], is this feature already completed? If so, may be we can close it as fixed? > [Umbrella] Work-preserving ApplicationMaster restart > > > Key: YARN-1489 > URL: https://issues.apache.org/jira/browse/YARN-1489 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Vinod Kumar Vavilapalli >Assignee: Vinod Kumar Vavilapalli > Attachments: Work preserving AM restart.pdf > > > Today if AMs go down, > - RM kills all the containers of that ApplicationAttempt > - New ApplicationAttempt doesn't know where the previous containers are > running > - Old running containers don't know where the new AM is running. > We need to fix this to enable work-preserving AM restart. The later two > potentially can be done at the app level, but it is good to have a common > solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13993962#comment-13993962 ] Karthik Kambatla commented on YARN-1489: Created a couple of sub-tasks based on an offline discussion with Anubhav, Bikas, Jian and Vinod. [Umbrella] Work-preserving ApplicationMaster restart Key: YARN-1489 URL: https://issues.apache.org/jira/browse/YARN-1489 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Attachments: Work preserving AM restart.pdf Today if AMs go down, - RM kills all the containers of that ApplicationAttempt - New ApplicationAttempt doesn't know where the previous containers are running - Old running containers don't know where the new AM is running. We need to fix this to enable work-preserving AM restart. The later two potentially can be done at the app level, but it is good to have a common solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13868891#comment-13868891 ] Bikas Saha commented on YARN-1489: -- We need to come to a conclusion on how to allow the containers to also find out about the new AM's. Something we have discussed in the past 1) New AM upon register provides an payload to the RM 2) RM syncs the payload with the NMs on heartbeat. RM-NM already sync on running application state. This payload could piggyback on that. 3) A container on an NM could query the NM about its own AM's payload. This local API could be secured by a local token and available to only containers running on the local node. 4) This payload would be used by the containers to reconnect with the AM (in case systems dont use external solutions like zookeeper for such tracking. This sounds reasonably light-weight, scalable and self-contained. All the interested parties would be informed within 2*(NmHeartbeat) time interval. [Umbrella] Work-preserving ApplicationMaster restart Key: YARN-1489 URL: https://issues.apache.org/jira/browse/YARN-1489 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Attachments: Work preserving AM restart.pdf Today if AMs go down, - RM kills all the containers of that ApplicationAttempt - New ApplicationAttempt doesn't know where the previous containers are running - Old running containers don't know where the new AM is running. We need to fix this to enable work-preserving AM restart. The later two potentially can be done at the app level, but it is good to have a common solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13864212#comment-13864212 ] Steve Loughran commented on YARN-1489: -- regarding the rebinding problem, YARN-913 proposes some registry where we restrict the names of services and apps, and require uniqueness. This lets us register something like (hoya, stevel, accumulo5) and then let a client app look it up. Today we have the list of running apps, and you can find and bind to one, but # there's nothing to stop a single user having 1 instance of the same name # there's no way for a AM to enumerate this as the list operation isn't in the AMRM protocol [Umbrella] Work-preserving ApplicationMaster restart Key: YARN-1489 URL: https://issues.apache.org/jira/browse/YARN-1489 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Attachments: Work preserving AM restart.pdf Today if AMs go down, - RM kills all the containers of that ApplicationAttempt - New ApplicationAttempt doesn't know where the previous containers are running - Old running containers don't know where the new AM is running. We need to fix this to enable work-preserving AM restart. The later two potentially can be done at the app level, but it is good to have a common solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13864365#comment-13864365 ] Steve Loughran commented on YARN-1489: -- Actually, the simplest way for an AM to work with a restarted cluster would be if there was a blocking operation to list active containers. At startup it could get that list and use it to init its data structures -on a first start the list would be empty. Alternatively, the restart information could be passed down in {{RegisterApplicationMasterResponse}} -which would avoid adding any new RPC calls [Umbrella] Work-preserving ApplicationMaster restart Key: YARN-1489 URL: https://issues.apache.org/jira/browse/YARN-1489 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Attachments: Work preserving AM restart.pdf Today if AMs go down, - RM kills all the containers of that ApplicationAttempt - New ApplicationAttempt doesn't know where the previous containers are running - Old running containers don't know where the new AM is running. We need to fix this to enable work-preserving AM restart. The later two potentially can be done at the app level, but it is good to have a common solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13864382#comment-13864382 ] Bikas Saha commented on YARN-1489: -- The POR is the attempt AMRM register RPC to return the currently running containers for that app. So when the attempt makes the initial sync with the RM then it will get all that info. [Umbrella] Work-preserving ApplicationMaster restart Key: YARN-1489 URL: https://issues.apache.org/jira/browse/YARN-1489 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Attachments: Work preserving AM restart.pdf Today if AMs go down, - RM kills all the containers of that ApplicationAttempt - New ApplicationAttempt doesn't know where the previous containers are running - Old running containers don't know where the new AM is running. We need to fix this to enable work-preserving AM restart. The later two potentially can be done at the app level, but it is good to have a common solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13862359#comment-13862359 ] Bikas Saha commented on YARN-1489: -- Here is an idea: The RM allows the app to send it some data during registration. This data could include the AM port information etc. The RM could then sync this data with the NM during NM heartbeat. The NM anyways maintain per app attempt info and this data would be added to that. The containers running on an AM could query for this attempt data and get the information about the new app attempt. This would be a scalable and efficient solution. The data per NM will be small since the data would be size checked and proportional to the app attempts. The NM could give access to an attempts data only to the containers that belong to that attempt. Only local containers should be able to communicate with their NM for such information. This could be done via a local access token that is supplied by the NM whenever it launches a container. [Umbrella] Work-preserving ApplicationMaster restart Key: YARN-1489 URL: https://issues.apache.org/jira/browse/YARN-1489 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Attachments: Work preserving AM restart.pdf Today if AMs go down, - RM kills all the containers of that ApplicationAttempt - New ApplicationAttempt doesn't know where the previous containers are running - Old running containers don't know where the new AM is running. We need to fix this to enable work-preserving AM restart. The later two potentially can be done at the app level, but it is good to have a common solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13859156#comment-13859156 ] Zhijie Shen commented on YARN-1489: --- Thanks Vinod for the proposal. One thought when I read the following point. bq. In case of apps like MapReduce where containers need to communicate directly with AMs, the old running-containers don’t know where the new ApplicationMaster is running and how to reach it (service addresses). During AM restarting, the container may try to send messages to AM in some application, and these messages may get lost. Is good to buffer the outstanding messages and send them to AM when rebinding? [Umbrella] Work-preserving ApplicationMaster restart Key: YARN-1489 URL: https://issues.apache.org/jira/browse/YARN-1489 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Attachments: Work preserving AM restart.pdf Today if AMs go down, - RM kills all the containers of that ApplicationAttempt - New ApplicationAttempt doesn't know where the previous containers are running - Old running containers don't know where the new AM is running. We need to fix this to enable work-preserving AM restart. The later two potentially can be done at the app level, but it is good to have a common solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13846072#comment-13846072 ] Vinod Kumar Vavilapalli commented on YARN-1489: --- bq. Would be good to see an overall design document.. Yup, writing something up.. [Umbrella] Work-preserving ApplicationMaster restart Key: YARN-1489 URL: https://issues.apache.org/jira/browse/YARN-1489 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Today if AMs go down, - RM kills all the containers of that ApplicationAttempt - New ApplicationAttempt doesn't know where the previous containers are running - Old running containers don't know where the new AM is running. We need to fix this to enable work-preserving AM restart. The later two potentially can be done at the app level, but it is good to have a common solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Commented] (YARN-1489) [Umbrella] Work-preserving ApplicationMaster restart
[ https://issues.apache.org/jira/browse/YARN-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13845124#comment-13845124 ] Bikas Saha commented on YARN-1489: -- Would be good to see an overall design document, specially for the tricky pieces like reconnecting existing running containers to new app attempts. [Umbrella] Work-preserving ApplicationMaster restart Key: YARN-1489 URL: https://issues.apache.org/jira/browse/YARN-1489 Project: Hadoop YARN Issue Type: Bug Reporter: Vinod Kumar Vavilapalli Assignee: Vinod Kumar Vavilapalli Today if AMs go down, - RM kills all the containers of that ApplicationAttempt - New ApplicationAttempt doesn't know where the previous containers are running - Old running containers don't know where the new AM is running. We need to fix this to enable work-preserving AM restart. The later two potentially can be done at the app level, but it is good to have a common solution for all apps where-ever possible. -- This message was sent by Atlassian JIRA (v6.1.4#6159)