[ 
https://issues.apache.org/jira/browse/YARN-1365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14031248#comment-14031248
 ] 

Jian He commented on YARN-1365:
-------------------------------

Thanks for updating the patch. 

The debug logging can be wrapped with isDebugEnabled condition
{code}
LOG.debug("Skipping notifying ATTEMPT_ADDED");
{code}

The following code is removed, but schedulers#addApplication are not handling 
the case to not send app_accepted events as we do for addApplicationAttempt.
My point was we can do the same for both addApplication and 
addApplicationAttempt to not send dup events. Given this is not relevant to 
this patch itself, we can fix this separately if needed.
{code}
    // ACCECPTED state can once again receive APP_ACCEPTED event, because on
    // recovery the app returns ACCEPTED state and the app once again go
    // through the scheduler and triggers one more APP_ACCEPTED event at
    // ACCEPTED state.
    .addTransition(RMAppState.ACCEPTE
{code}

This transition can never happen ? given that unregister also has to do resync.
{code}
.addTransition(RMAppAttemptState.LAUNCHED,
          EnumSet.of(RMAppAttemptState.FINAL_SAVING, 
RMAppAttemptState.FINISHED),
          RMAppAttemptEventType.UNREGISTERED, new AMUnregisteredTransition())
{code}

This piece of code is not needed, the previous launchAM internally checks the 
app state already. We can use MockRM.launchAndRegisterAM alternatively. The 
test case can be moved to TestWorkPreservingRMRestart
{code}
    nm1.nodeHeartbeat(am0.getApplicationAttemptId(), 1, ContainerState.RUNNING);
    am0.waitForState(RMAppAttemptState.RUNNING);
    rm1.waitForState(app0.getApplicationId(), RMAppState.RUNNING);
{code}


*Just thinking*:
Does it make sense to map AMCommand(shutdown, resync) to corresponding 
exceptions? The benefit is that we don’t need to add extra fields in AMS 
protocol response and user not using AMRMClient will be forced to handle such 
condition to work with RM restart. thoughts? 

> ApplicationMasterService to allow Register and Unregister of an app that was 
> running before restart
> ---------------------------------------------------------------------------------------------------
>
>                 Key: YARN-1365
>                 URL: https://issues.apache.org/jira/browse/YARN-1365
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Anubhav Dhoot
>         Attachments: YARN-1365.001.patch, YARN-1365.002.patch, 
> YARN-1365.003.patch, YARN-1365.004.patch, YARN-1365.005.patch, 
> YARN-1365.005.patch, YARN-1365.initial.patch
>
>
> For an application that was running before restart, the 
> ApplicationMasterService currently throws an exception when the app tries to 
> make the initial register or final unregister call. These should succeed and 
> the RMApp state machine should transition to completed like normal. 
> Unregistration should succeed for an app that the RM considers complete since 
> the RM may have died after saving completion in the store but before 
> notifying the AM that the AM is free to exit.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to