[
https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820378#comment-13820378
]
Omkar Vinit Joshi commented on YARN-1338:
-----------------------------------------
Thanks [~jlowe]
bq. I would rather not tie a checksum to this. Corruption of the file isn't
related to whether the NM is restarting, and it seems odd to only check for
corruption on restart rather than every time the resource is requested. IMHO we
should treat checksums for localized resources as an orthogonal feature request
to this. (It would also significantly slow down the recovery time if the NM had
to checksum-compare everything in the distcache on startup.)
Yes I completely agree..checksum should be an additional feature rather than
done as a part of this.
bq. So if we persist the LocalResourceRequest to LocalizedResource map then we
can tell after a recovery whether we already have the requested resource or not
when a new request arrives.
Agreed. This way we will have all the information we need to reconstruct the
cache.
bq. We have a very rough start on persisting the local cache state, and I plan
on working on this in earnest in the next few weeks.
good ...
any thoughts on how and when we are planning to store the container's resource
request and newly downloaded resource request to persistent store?
* clearly for resource request it should be quite clear. When download finishes
and resource is marked as LOCALIZED..we should save the info...(the way
RMRestart is doing today for RMAppImpl...NEW...to...NEW_SAVING...to...SUBMITTED)
* But for container request it will become little bit tricky...
** When we initially get resource request for all the required resources during
container start?
** or when individual resource request gets satisfied (as they are added to ref
of LocalizedResource)
** or when for container all the resources are downloaded / localized?
3rd scenario looks good to me because
* by then we will have information about all the localized resources. If
downloading failed for any of them then we frankly don't care about storing
partial success so we can avoid this write.
* Also when container finishes / fails we can simply remove the entry
Any thoughts whether we want to avoid container start before we process all the
writes to store or can we start in parallel? Clearly parallel writes don't look
good to me because if any of the write events are in flight and nm restarts
then after restart we won't know about those changes..but at the same time if
we wait for all the writes to go through then we are delaying container start
by that duration.
> Recover localized resource cache state upon nodemanager restart
> ---------------------------------------------------------------
>
> Key: YARN-1338
> URL: https://issues.apache.org/jira/browse/YARN-1338
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
> Assignee: Jason Lowe
>
> Today when node manager restarts we clean up all the distributed cache files
> from disk. This is definitely not ideal from 2 aspects.
> * For work preserving restart we definitely want them as running containers
> are using them
> * For even non work preserving restart this will be useful in the sense that
> we don't have to download them again if needed by future tasks.
--
This message was sent by Atlassian JIRA
(v6.1#6144)