[ https://issues.apache.org/jira/browse/YARN-3389?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jun Gong updated YARN-3389: --------------------------- Description: In AttemptFailedTransition, new attempt will get states('justFinishedContainers' and 'finishedContainersSentToAM') reference from failed attempt. Then these attempts share the two states(previous attempts also share the two states). Suppose two or more CONTAINER_FINISHED events for different attempts are handled at the same time, and suppose they ran on same node. Attempts will update justFinishedContainers's same key's value concurrently. Although 'justFinishedContainers' is a ConcurrentHashMap, operations on its value 'List<ContainerStatus>' is not atomic, namely {code}appAttempt.justFinishedContainers.get(containerFinishedEvent.getNodeId()).add(containerFinishedEvent.getContainerStatus()){code} is not atomic. (was: In AttemptFailedTransition, the new attempt will get state('justFinishedContainers' and 'finishedContainersSentToAM') reference from the failed attempt. Then the two attempts might operate on these two variables concurrently, e.g. they might update 'justFinishedContainers' concurrently when they are both handling CONTAINER_FINISHED event.) > Avoid race conditions when attempts operate on shared states concurrently > ------------------------------------------------------------------------- > > Key: YARN-3389 > URL: https://issues.apache.org/jira/browse/YARN-3389 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.6.0 > Reporter: Jun Gong > Assignee: Jun Gong > Attachments: YARN-3389.01.patch > > > In AttemptFailedTransition, new attempt will get > states('justFinishedContainers' and 'finishedContainersSentToAM') reference > from failed attempt. Then these attempts share the two states(previous > attempts also share the two states). Suppose two or more CONTAINER_FINISHED > events for different attempts are handled at the same time, and suppose they > ran on same node. Attempts will update justFinishedContainers's same key's > value concurrently. Although 'justFinishedContainers' is a ConcurrentHashMap, > operations on its value 'List<ContainerStatus>' is not atomic, namely > {code}appAttempt.justFinishedContainers.get(containerFinishedEvent.getNodeId()).add(containerFinishedEvent.getContainerStatus()){code} > is not atomic. -- This message was sent by Atlassian JIRA (v6.3.4#6332)