[
https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jason Lowe updated YARN-1338:
-----------------------------
Attachment: YARN-1338v6.patch
Thanks for the additional comments, Junping.
bq. Do we have any code to destroy DB items for NMState when NM is
decommissioned (not expecting short-term restart)?
Good point. I added shutdown code that removes the recovery directory if the
shutdown is due to a decommission. I also added a unit test for this scenario.
{quote}
In LocalResourcesTrackerImpl#recoverResource()
+ incrementFileCountForLocalCacheDirectory(localDir.getParent());
Given localDir is already the parent of localPath, may be we should just
increment locaDir rather than its parent? I didn't see we have unit test to
check file count for resource directory after recovery. May be we should add
some?
{quote}
The last component of localDir is the unique resource ID and not a directory
managed by the local cache directory manager. The directory allocated by the
local cache directory manager has an additional directory added by the
localization process which is named after the unique ID for the local resource.
For example, the localPath might be something like
/local/root/0/1/52/resource.jar and localDir is /local/root/0/1/52. The '52'
is the unique resource ID (always >= 10 so it can't conflict with
single-character cache mgr subdirs) and /local/root/0/1 is the directory
managed by the local dir cache manager. If we passed localDir to the local dir
cache manager it would get confused since it would try to parse the last
component as a subdirectory it created but it isn't that.
I did add a unit test to verify local cache directory counts are incremented
properly when resources are recovered. This required exposing a couple of
methods as package-private to get the necessary information for the test.
> Recover localized resource cache state upon nodemanager restart
> ---------------------------------------------------------------
>
> Key: YARN-1338
> URL: https://issues.apache.org/jira/browse/YARN-1338
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
> Assignee: Jason Lowe
> Attachments: YARN-1338.patch, YARN-1338v2.patch,
> YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch, YARN-1338v5.patch,
> YARN-1338v6.patch
>
>
> Today when node manager restarts we clean up all the distributed cache files
> from disk. This is definitely not ideal from 2 aspects.
> * For work preserving restart we definitely want them as running containers
> are using them
> * For even non work preserving restart this will be useful in the sense that
> we don't have to download them again if needed by future tasks.
--
This message was sent by Atlassian JIRA
(v6.2#6252)