Bikas Saha commented on YARN-1366:

bq. Seems we have a race that allocate call gets the resync and do the 
re-register even after the finishApplicationMaster is called. Checked the MR 
code that this cannot happen because the allocate thread is interrupted and 
joined before calling unregister. We may document the API say that allocate 
should not be called after finishApplicationMaster or handle it explicitly in 
RM ?
If the AMRMClientAsync is not doing this then we should fix it.

bq.There’s a response map in AMS to differentiate the attempt, I think this 
should work already.
That is for the running RM right? How does the restarted RM to do it? 
Currently, absence of an entry for that AM in the responseMap is the cause for 
asking the AM to resync.

> ApplicationMasterService should Resync with the AM upon allocate call after 
> restart
> -----------------------------------------------------------------------------------
>                 Key: YARN-1366
>                 URL: https://issues.apache.org/jira/browse/YARN-1366
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Bikas Saha
>            Assignee: Rohith
>         Attachments: YARN-1366.1.patch, YARN-1366.2.patch, YARN-1366.patch, 
> YARN-1366.prototype.patch, YARN-1366.prototype.patch
> The ApplicationMasterService currently sends a resync response to which the 
> AM responds by shutting down. The AM behavior is expected to change to 
> calling resyncing with the RM. Resync means resetting the allocate RPC 
> sequence number to 0 and the AM should send its entire outstanding request to 
> the RM. Note that if the AM is making its first allocate call to the RM then 
> things should proceed like normal without needing a resync. The RM will 
> return all containers that have completed since the RM last synced with the 
> AM. Some container completions may be reported more than once.

This message was sent by Atlassian JIRA

Reply via email to