[ 
https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820378#comment-13820378
 ] 

Omkar Vinit Joshi commented on YARN-1338:
-----------------------------------------

Thanks [~jlowe] 
bq. I would rather not tie a checksum to this. Corruption of the file isn't 
related to whether the NM is restarting, and it seems odd to only check for 
corruption on restart rather than every time the resource is requested. IMHO we 
should treat checksums for localized resources as an orthogonal feature request 
to this. (It would also significantly slow down the recovery time if the NM had 
to checksum-compare everything in the distcache on startup.)
Yes I completely agree..checksum should be an additional feature rather than 
done as a part of this. 

bq. So if we persist the LocalResourceRequest to LocalizedResource map then we 
can tell after a recovery whether we already have the requested resource or not 
when a new request arrives.
Agreed. This way we will have all the information we need to reconstruct the 
cache. 

bq. We have a very rough start on persisting the local cache state, and I plan 
on working on this in earnest in the next few weeks.
good ... 

any thoughts on how and when we are planning to store the container's resource 
request and newly downloaded resource request to persistent store?
* clearly for resource request it should be quite clear. When download finishes 
and resource is marked as LOCALIZED..we should save the info...(the way 
RMRestart is doing today for RMAppImpl...NEW...to...NEW_SAVING...to...SUBMITTED)
* But for container request it will become little bit tricky...
** When we initially get resource request for all the required resources during 
container start?
** or when individual resource request gets satisfied (as they are added to ref 
of LocalizedResource)
** or when for container all the resources are downloaded / localized?
3rd scenario looks good to me because 
* by then we will have information about all the localized resources. If 
downloading failed for any of them then we frankly don't care about storing 
partial success so we can avoid this write.
* Also when container finishes / fails we can simply remove the entry
Any thoughts whether we want to avoid container start before we process all the 
writes to store or can we start in parallel? Clearly parallel writes don't look 
good to me because if any of the write events are in flight and nm restarts 
then after restart we won't know about those changes..but at the same time if 
we wait for all the writes to go through then we are delaying container start 
by that duration.

> Recover localized resource cache state upon nodemanager restart
> ---------------------------------------------------------------
>
>                 Key: YARN-1338
>                 URL: https://issues.apache.org/jira/browse/YARN-1338
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>
> Today when node manager restarts we clean up all the distributed cache files 
> from disk. This is definitely not ideal from 2 aspects.
> * For work preserving restart we definitely want them as running containers 
> are using them
> * For even non work preserving restart this will be useful in the sense that 
> we don't have to download them again if needed by future tasks.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to