[ 
https://issues.apache.org/jira/browse/YARN-1618?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13880745#comment-13880745
 ] 

Bikas Saha commented on YARN-1618:
----------------------------------

App goes from NEW->NEW_SAVING upon receiving START. It goes from 
NEW-SAVING->SUBMITTED after app is saved. It goes from NEW_SAVING->FINAL_SAVING 
if killed while saving. All of these work fine. 
Given the above transitions, app should go from NEW->KILLED if its killed 
before receiving the START event. START event should be ignored in KILLED state 
(currently it is not ignored). So if START comes after KILL then its a no-op. 
If START comes before KILL then state store is fine since the app will first be 
saved and then updated.
Its interesting that we caught the race such that KILL came before START. The 
START should come almost immediately after the RMAppImpl object is created in a 
NEW state during regular app submission. Karthik, are we sure that this 
happened? This should not happen during recovery time since the RMAppImpl moves 
from NEW->NEXT_STATE after receiving the RECOVER event. RPC servers should not 
be running during recovery. Vinod, is it still the case that RPC servers are 
started after recovery is complete?

> Applications transition from NEW to FINAL_SAVING, and try to update 
> non-existing entries in the state-store
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1618
>                 URL: https://issues.apache.org/jira/browse/YARN-1618
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>    Affects Versions: 2.2.0
>            Reporter: Karthik Kambatla
>            Assignee: Karthik Kambatla
>            Priority: Blocker
>         Attachments: yarn-1618-1.patch
>
>
> YARN-891 augments the RMStateStore to store information on completed 
> applications. In the process, it adds transitions from NEW to FINAL_SAVING. 
> This leads to the RM trying to update entries in the state-store that do not 
> exist. On ZKRMStateStore, this leads to the RM crashing. 
> Previous description:
> ZKRMStateStore fails to handle updates to znodes that don't exist. For 
> instance, this can happen when an app transitions from NEW to FINAL_SAVING. 
> In these cases, the store should create the missing znode and handle the 
> update.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)

Reply via email to