[
https://issues.apache.org/jira/browse/YARN-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880745#comment-13880745
]
Bikas Saha commented on YARN-1618:
----------------------------------
App goes from NEW->NEW_SAVING upon receiving START. It goes from
NEW-SAVING->SUBMITTED after app is saved. It goes from NEW_SAVING->FINAL_SAVING
if killed while saving. All of these work fine.
Given the above transitions, app should go from NEW->KILLED if its killed
before receiving the START event. START event should be ignored in KILLED state
(currently it is not ignored). So if START comes after KILL then its a no-op.
If START comes before KILL then state store is fine since the app will first be
saved and then updated.
Its interesting that we caught the race such that KILL came before START. The
START should come almost immediately after the RMAppImpl object is created in a
NEW state during regular app submission. Karthik, are we sure that this
happened? This should not happen during recovery time since the RMAppImpl moves
from NEW->NEXT_STATE after receiving the RECOVER event. RPC servers should not
be running during recovery. Vinod, is it still the case that RPC servers are
started after recovery is complete?
> Applications transition from NEW to FINAL_SAVING, and try to update
> non-existing entries in the state-store
> -----------------------------------------------------------------------------------------------------------
>
> Key: YARN-1618
> URL: https://issues.apache.org/jira/browse/YARN-1618
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: resourcemanager
> Affects Versions: 2.2.0
> Reporter: Karthik Kambatla
> Assignee: Karthik Kambatla
> Priority: Blocker
> Attachments: yarn-1618-1.patch
>
>
> YARN-891 augments the RMStateStore to store information on completed
> applications. In the process, it adds transitions from NEW to FINAL_SAVING.
> This leads to the RM trying to update entries in the state-store that do not
> exist. On ZKRMStateStore, this leads to the RM crashing.
> Previous description:
> ZKRMStateStore fails to handle updates to znodes that don't exist. For
> instance, this can happen when an app transitions from NEW to FINAL_SAVING.
> In these cases, the store should create the missing znode and handle the
> update.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)