[ https://issues.apache.org/jira/browse/YARN-1879?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972318#comment-13972318 ]
Tsuyoshi OZAWA commented on YARN-1879: -------------------------------------- [~xgong], As you pointed out, RetryCache doesn't ensure AtMostOnce semantics completely. If active RM fails, we cannot reconstruct RetryCache because ZK doesn't have enough information to reconstruct it. However, RetryCache ensure AtMostOnce in limited situation: temporal network failure or unstable network communication between client(e.g. AM) and server(e.g. RM) or client-side timer failure. I targeted the failure. HDFS-4942 describes the detail: {quote} In current HA mechanism with FailoverProxyProvider and non HA setups with RetryProxy retry a request from the RPC layer. If the retried request has already been processed at the namenode, the subsequent attempts fail for non-idempotent operations such as create, append, delete, rename etc. This will cause application failures during HA failover, network issues etc. {quote} ApplicationMasterProtocol also uses RMProxy, so same problem can occur. > Mark Idempotent/AtMostOnce annotations to ApplicationMasterProtocol > ------------------------------------------------------------------- > > Key: YARN-1879 > URL: https://issues.apache.org/jira/browse/YARN-1879 > Project: Hadoop YARN > Issue Type: Sub-task > Components: resourcemanager > Reporter: Jian He > Assignee: Tsuyoshi OZAWA > Priority: Critical > Attachments: YARN-1879.1.patch, YARN-1879.1.patch, > YARN-1879.2-wip.patch, YARN-1879.2.patch, YARN-1879.3.patch, > YARN-1879.4.patch, YARN-1879.5.patch > > -- This message was sent by Atlassian JIRA (v6.2#6252)