[ https://issues.apache.org/jira/browse/YARN-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880745#comment-13880745 ]
Bikas Saha commented on YARN-1618: ---------------------------------- App goes from NEW->NEW_SAVING upon receiving START. It goes from NEW-SAVING->SUBMITTED after app is saved. It goes from NEW_SAVING->FINAL_SAVING if killed while saving. All of these work fine. Given the above transitions, app should go from NEW->KILLED if its killed before receiving the START event. START event should be ignored in KILLED state (currently it is not ignored). So if START comes after KILL then its a no-op. If START comes before KILL then state store is fine since the app will first be saved and then updated. Its interesting that we caught the race such that KILL came before START. The START should come almost immediately after the RMAppImpl object is created in a NEW state during regular app submission. Karthik, are we sure that this happened? This should not happen during recovery time since the RMAppImpl moves from NEW->NEXT_STATE after receiving the RECOVER event. RPC servers should not be running during recovery. Vinod, is it still the case that RPC servers are started after recovery is complete? > Applications transition from NEW to FINAL_SAVING, and try to update > non-existing entries in the state-store > ----------------------------------------------------------------------------------------------------------- > > Key: YARN-1618 > URL: https://issues.apache.org/jira/browse/YARN-1618 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Affects Versions: 2.2.0 > Reporter: Karthik Kambatla > Assignee: Karthik Kambatla > Priority: Blocker > Attachments: yarn-1618-1.patch > > > YARN-891 augments the RMStateStore to store information on completed > applications. In the process, it adds transitions from NEW to FINAL_SAVING. > This leads to the RM trying to update entries in the state-store that do not > exist. On ZKRMStateStore, this leads to the RM crashing. > Previous description: > ZKRMStateStore fails to handle updates to znodes that don't exist. For > instance, this can happen when an app transitions from NEW to FINAL_SAVING. > In these cases, the store should create the missing znode and handle the > update. -- This message was sent by Atlassian JIRA (v6.1.5#6160)