[ 
https://issues.apache.org/jira/browse/YARN-8672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16659755#comment-16659755
 ] 

Chandni Singh commented on YARN-8672:
-------------------------------------

The test {{testLocalizingResourceWhileContainerRunning}} tries to localize 2 
files in sequence for the same container-
 # file
 # file2
 Localization of file2 is requested before localization of the first file 
completes.

{{LocalizerRunner}}, after writing credentials and localization, tries to 
deletes tokens file.

Below are the sequence of events that are causing the error:
 * Localization starts for the first file
{code:java}
2018-10-22 13:27:24,249 DEBUG [NM ContainerManager dispatcher] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:handleInitContainerResources(518)) - 
Localizing 
file:/Users/cnisingh/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-tmpDir/dir/file
 for container container_0_0000_01_000000
{code}

 * A {{LocalizerRunner}} is created for container_0_0000_01_000000. Let's call 
this LR1.
 * The LR1 writes credentials for the first file
{code:java}
 2018-10-22 13:27:24,316 INFO  [LocalizerRunner for container_0_0000_01_000000] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:writeCredentials(1328)) - Writing credentials 
to the nmPrivate file 
/Users/cnisingh/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens
{code}

 * LR1 finishes localizing {{file}} but container tokens is not yet deleted.

 * Meanwhile file2 is requested to localize.
{code:java}
2018-10-22 13:27:25,273 DEBUG [NM ContainerManager dispatcher] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:handleInitContainerResources(518)) - 
Localizing 
file:/Users/cnisingh/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-tmpDir/dir/file2
 for container container_0_0000_01_000000
{code}

 * Before creating a second localizer runner for 
{{container_0_0000_01_000000}}, the first LR1 is removed
{code:java}
2018-10-22 13:27:25,273 INFO  [NM ContainerManager dispatcher] 
localizer.ResourceLocalizationService 
(ResourceLocalizationService.java:handle(792)) - New 
REQUEST_RESOURCE_LOCALIZATION localize request for container_0_0000_01_000000, 
remove old private localizer.
{code}

 * LR2 starts writing credentials before localizing {{file2}}.

 * While the LR2 is writing credentials for file2, LR1 deletes 
container_0_0000_01_000000.tokens
{code:java}
 [LocalizerRunner for container_0_0000_01_000000] nodemanager.DeletionService 
(DeletionService.java:delete(91)) - Scheduling DeletionTask (delay 0) : 
FileDeletionTask :  id : -1  user : null  subDir : 
/Users/cnisingh/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens
  baseDir : null

2018-10-22 13:27:25,274 DEBUG [DeletionService #2] task.DeletionTask 
(FileDeletionTask.java:run(100)) - Running DeletionTask : FileDeletionTask :  
id : -1  user : null  subDir : 
/Users/cnisingh/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens
  baseDir : null

2018-10-22 13:27:25,275 DEBUG [DeletionService #2] task.DeletionTask 
(FileDeletionTask.java:run(106)) - NM deleting absolute path : 
/Users/cnisingh/workspace/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/target/TestContainerManager-localDir/nmPrivate/container_0_0000_01_000000.tokens
{code}

 * This causes failure in localization of file2.

I think the bug is in this method:
{code:java}
    public void cleanupPrivLocalizers(String locId) {
      synchronized (privLocalizers) {
        LocalizerRunner localizer = privLocalizers.get(locId);
        if (null == localizer) {
          return; // ignore; already gone
        }
        privLocalizers.remove(locId);
        localizer.interrupt();
      }
    }
{code}
It is assumed that {{localizer.interrupt}} will just stop the localizer. 
Instead I think we need to wait here for the prev localizer to stop and then 
continue.

cc. [~jlowe]

> TestContainerManager#testLocalingResourceWhileContainerRunning occasionally 
> times out
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-8672
>                 URL: https://issues.apache.org/jira/browse/YARN-8672
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.2.0
>            Reporter: Jason Lowe
>            Assignee: Chandni Singh
>            Priority: Major
>
> Precommit builds have been failing in 
> TestContainerManager#testLocalingResourceWhileContainerRunning.  I have been 
> able to reproduce the problem without any patch applied if I run the test 
> enough times.  It looks like something is removing container tokens from the 
> nmPrivate area just as a new localizer starts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to