[ 
https://issues.apache.org/jira/browse/YARN-8672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16664229#comment-16664229
 ] 

Jason Lowe commented on YARN-8672:
----------------------------------

Thanks for updating the patch!  I'm still skeptical this is going to work well 
in practice for some corner cases.  For example, what if FileDeletionService 
has been configured with a delay?  Deletes would be significantly delayed, and 
then the tokens file can be removed just as the new one is getting created.

Couple of potential fixes:
# Have localizers use token files private to that localizer instance.  Then 
each localizer is responsible for reaping its personal tokens file without 
concerns of deleting it just as a new localizer spins up to use it.  Seems we 
would always have a race trying to share the tokens file given we cannot 
control the time period between when we want to delete something and when it 
actually gets deleted.
# Never delete the tokens file until the container completes.  This could have 
implications if the tokens file needs to be different between different 
localizers (i.e.: credentials of the container were updated since the first 
localizer).

My preference would be to separate the token files.


> TestContainerManager#testLocalingResourceWhileContainerRunning occasionally 
> times out
> -------------------------------------------------------------------------------------
>
>                 Key: YARN-8672
>                 URL: https://issues.apache.org/jira/browse/YARN-8672
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 3.2.0
>            Reporter: Jason Lowe
>            Assignee: Chandni Singh
>            Priority: Major
>         Attachments: YARN-8672.001.patch, YARN-8672.002.patch, 
> YARN-8672.003.patch, YARN-8672.004.patch
>
>
> Precommit builds have been failing in 
> TestContainerManager#testLocalingResourceWhileContainerRunning.  I have been 
> able to reproduce the problem without any patch applied if I run the test 
> enough times.  It looks like something is removing container tokens from the 
> nmPrivate area just as a new localizer starts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to