[ https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14003489#comment-14003489 ]
Junping Du commented on YARN-1338: ---------------------------------- [~jlowe], thanks again for your patch here! A few comments so far: One question in general: beside null store and a leveled store, I saw a memory store implemented there but no usage so far. Does it helps in some scenario or only for test purpose? In NodeManager#serviceInit() {code} if (recoveryEnabled) { ... + nmStore = new NMLeveldbStateStoreService(); + } else { + nmStore = new NMNullStateStoreService(); } + nmStore.init(conf); + nmStore.start(); {code} Can we abstract code since if block into a method, something like: initializeNMStore(conf)? which can make NodeManager#serviceInit() simpler. In yarn_server_nodemanager_recovery.proto, {code} +message LocalizedResourceProto { + optional LocalResourceProto resource = 1; + optional string localPath = 2; + optional int64 size = 3; +} {code} Does size here represent for size of local resource? If so, may be duplicated with the size within LocalResourceProto? In ResourceLocalizationService.java {code} + //Recover localized resources after an NM restart + public void recoverLocalizedResources(RecoveredLocalizationState state) + throws URISyntaxException { + ... + for (Map.Entry<ApplicationId, LocalResourceTrackerState> appEntry : + userResources.getAppTrackerStates().entrySet()) { + ApplicationId appId = appEntry.getKey(); + ... + recoverTrackerResources(tracker, appEntry.getValue()); + } + } + } {code} May be we should check appResourceState(appEntry.getValue)’s localizedResources and inProgressResources is not empty before recover it as we check for userResourceState? In NMMemoryStateStoreService#loadLocalizationState() {code} ... + if (tk.appId == null) { + rur.privateTrackerState = loadTrackerState(ts); + } else { + rur.appTrackerStates.put(tk.appId, loadTrackerState(ts)); + } ... {code} May be even in case tk.appId !=null, we should load private resource state as well? Given the patch is big enough, I haven’t finished my review although walk though it a few times. More comments may come later. > Recover localized resource cache state upon nodemanager restart > --------------------------------------------------------------- > > Key: YARN-1338 > URL: https://issues.apache.org/jira/browse/YARN-1338 > Project: Hadoop YARN > Issue Type: Sub-task > Components: nodemanager > Affects Versions: 2.3.0 > Reporter: Jason Lowe > Assignee: Jason Lowe > Attachments: YARN-1338.patch, YARN-1338v2.patch, > YARN-1338v3-and-YARN-1987.patch, YARN-1338v4.patch > > > Today when node manager restarts we clean up all the distributed cache files > from disk. This is definitely not ideal from 2 aspects. > * For work preserving restart we definitely want them as running containers > are using them > * For even non work preserving restart this will be useful in the sense that > we don't have to download them again if needed by future tasks. -- This message was sent by Atlassian JIRA (v6.2#6252)