Jason Lowe updated YARN-1338:

    Attachment: YARN-1338v6.patch

Thanks for the additional comments, Junping.

bq. Do we have any code to destroy DB items for NMState when NM is 
decommissioned (not expecting short-term restart)?

Good point.  I added shutdown code that removes the recovery directory if the 
shutdown is due to a decommission.  I also added a unit test for this scenario.

In LocalResourcesTrackerImpl#recoverResource()

+    incrementFileCountForLocalCacheDirectory(localDir.getParent());

Given localDir is already the parent of localPath, may be we should just 
increment locaDir rather than its parent? I didn't see we have unit test to 
check file count for resource directory after recovery. May be we should add 

The last component of localDir is the unique resource ID and not a directory 
managed by the local cache directory manager.  The directory allocated by the 
local cache directory manager has an additional directory added by the 
localization process which is named after the unique ID for the local resource. 
 For example, the localPath might be something like 
/local/root/0/1/52/resource.jar and localDir is /local/root/0/1/52.  The '52' 
is the unique resource ID (always >= 10 so it can't conflict with 
single-character cache mgr subdirs) and /local/root/0/1 is the directory 
managed by the local dir cache manager.  If we passed localDir to the local dir 
cache manager it would get confused since it would try to parse the last 
component as a subdirectory it created but it isn't that.

I did add a unit test to verify local cache directory counts are incremented 
properly when resources are recovered.  This required exposing a couple of 
methods as package-private to get the necessary information for the test.

> Recover localized resource cache state upon nodemanager restart
> ---------------------------------------------------------------
>                 Key: YARN-1338
>                 URL: https://issues.apache.org/jira/browse/YARN-1338
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: YARN-1338.patch, YARN-1338v2.patch, 
> YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch, YARN-1338v5.patch, 
> YARN-1338v6.patch
> Today when node manager restarts we clean up all the distributed cache files 
> from disk. This is definitely not ideal from 2 aspects.
> * For work preserving restart we definitely want them as running containers 
> are using them
> * For even non work preserving restart this will be useful in the sense that 
> we don't have to download them again if needed by future tasks.

This message was sent by Atlassian JIRA

Reply via email to