[ https://issues.apache.org/jira/browse/YARN-3464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vinod Kumar Vavilapalli updated YARN-3464: ------------------------------------------ Fix Version/s: 2.6.1 Pulled this into 2.6.1. Patch had 3 merge conflicts, fixed them. Ran compilation and TestResourceLocalizationService before the push. > Race condition in LocalizerRunner kills localizer before localizing all > resources > --------------------------------------------------------------------------------- > > Key: YARN-3464 > URL: https://issues.apache.org/jira/browse/YARN-3464 > Project: Hadoop YARN > Issue Type: Bug > Components: nodemanager > Reporter: zhihai xu > Assignee: zhihai xu > Priority: Critical > Labels: 2.6.1-candidate > Fix For: 2.6.1, 2.7.1 > > Attachments: YARN-3464.000.patch, YARN-3464.001.patch > > > Race condition in LocalizerRunner causes container localization timeout. > Currently LocalizerRunner will kill the ContainerLocalizer when pending list > for LocalizerResourceRequestEvent is empty. > {code} > } else if (pending.isEmpty()) { > action = LocalizerAction.DIE; > } > {code} > If a LocalizerResourceRequestEvent is added after LocalizerRunner kill the > ContainerLocalizer due to empty pending list, this > LocalizerResourceRequestEvent will never be handled. > Without ContainerLocalizer, LocalizerRunner#update will never be called. > The container will stay at LOCALIZING state, until the container is killed by > AM due to TASK_TIMEOUT. -- This message was sent by Atlassian JIRA (v6.3.4#6332)