[
https://issues.apache.org/jira/browse/YARN-8672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659755#comment-16659755
]
Chandni Singh commented on YARN-8672:
-------------------------------------
The test {{testLocalizingResourceWhileContainerRunning}} tries to localize 2
files in sequence for the same container-
# file
# file2
Localization of file2 is requested before localization of the first file
completes.
{{LocalizerRunner}}, after writing credentials and localization, tries to
deletes tokens file.
Below are the sequence of events that are causing the error:
* Localization starts for the first file
{code:java}
2018-10-22 13:27:24,249 DEBUG [NM ContainerManager dispatcher]
localizer.ResourceLocalizationService
(ResourceLocalizationService.java:handleInitContainerResources(518)) -
Localizing
file:/Users/cnisingh/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-tmpDir/dir/file
for container container_0_0000_01_000000
{code}
* A {{LocalizerRunner}} is created for container_0_0000_01_000000. Let's call
this LR1.
* The LR1 writes credentials for the first file
{code:java}
2018-10-22 13:27:24,316 INFO [LocalizerRunner for container_0_0000_01_000000]
localizer.ResourceLocalizationService
(ResourceLocalizationService.java:writeCredentials(1328)) - Writing credentials
to the nmPrivate file
/Users/cnisingh/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens
{code}
* LR1 finishes localizing {{file}} but container tokens is not yet deleted.
* Meanwhile file2 is requested to localize.
{code:java}
2018-10-22 13:27:25,273 DEBUG [NM ContainerManager dispatcher]
localizer.ResourceLocalizationService
(ResourceLocalizationService.java:handleInitContainerResources(518)) -
Localizing
file:/Users/cnisingh/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-tmpDir/dir/file2
for container container_0_0000_01_000000
{code}
* Before creating a second localizer runner for
{{container_0_0000_01_000000}}, the first LR1 is removed
{code:java}
2018-10-22 13:27:25,273 INFO [NM ContainerManager dispatcher]
localizer.ResourceLocalizationService
(ResourceLocalizationService.java:handle(792)) - New
REQUEST_RESOURCE_LOCALIZATION localize request for container_0_0000_01_000000,
remove old private localizer.
{code}
* LR2 starts writing credentials before localizing {{file2}}.
* While the LR2 is writing credentials for file2, LR1 deletes
container_0_0000_01_000000.tokens
{code:java}
[LocalizerRunner for container_0_0000_01_000000] nodemanager.DeletionService
(DeletionService.java:delete(91)) - Scheduling DeletionTask (delay 0) :
FileDeletionTask : id : -1 user : null subDir :
/Users/cnisingh/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens
baseDir : null
2018-10-22 13:27:25,274 DEBUG [DeletionService #2] task.DeletionTask
(FileDeletionTask.java:run(100)) - Running DeletionTask : FileDeletionTask :
id : -1 user : null subDir :
/Users/cnisingh/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens
baseDir : null
2018-10-22 13:27:25,275 DEBUG [DeletionService #2] task.DeletionTask
(FileDeletionTask.java:run(106)) - NM deleting absolute path :
/Users/cnisingh/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens
{code}
* This causes failure in localization of file2.
I think the bug is in this method:
{code:java}
public void cleanupPrivLocalizers(String locId) {
synchronized (privLocalizers) {
LocalizerRunner localizer = privLocalizers.get(locId);
if (null == localizer) {
return; // ignore; already gone
}
privLocalizers.remove(locId);
localizer.interrupt();
}
}
{code}
It is assumed that {{localizer.interrupt}} will just stop the localizer.
Instead I think we need to wait here for the prev localizer to stop and then
continue.
cc. [~jlowe]
> TestContainerManager#testLocalingResourceWhileContainerRunning occasionally
> times out
> -------------------------------------------------------------------------------------
>
> Key: YARN-8672
> URL: https://issues.apache.org/jira/browse/YARN-8672
> Project: Hadoop YARN
> Issue Type: Bug
> Components: nodemanager
> Affects Versions: 3.2.0
> Reporter: Jason Lowe
> Assignee: Chandni Singh
> Priority: Major
>
> Precommit builds have been failing in
> TestContainerManager#testLocalingResourceWhileContainerRunning. I have been
> able to reproduce the problem without any patch applied if I run the test
> enough times. It looks like something is removing container tokens from the
> nmPrivate area just as a new localizer starts.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]