[
https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820217#comment-13820217
]
Jason Lowe commented on YARN-1338:
----------------------------------
bq. RemoteUrl (Here do we need to trust that the old and new url are
identical..not changed)?
To resolve that we need to also persist the {{localrsrc}} map in
LocalResourcesTrackerImpl as it allows us to tell if a resource being requested
is one we already have. The request contains the timestamp of the remote
resource which is combined with the remote path, resource type, and any pattern
specified to identify the resource being requested. So if we persist the
LocalResourceRequest to LocalizedResource map then we can tell after a recovery
whether we already have the requested resource or not when a new request
arrives.
As for whether the remote resource has changed, arguably the point is moot wrt.
recovery. The containers already running don't want us to change the resource
they're currently using even if it has changed remotely. Any new resources
requested for the same path will compare against the recovered resource
requests and realize that the remote resource has changed (e.g.: new timestamp)
and therefore it's a different resource request than the one already persisted
and needs to be localized separately.
bq. we store the resources inside the distributed cache in an hierarchical
manner (to avoid unix directory limit)... we may need to recover that too).
Yes, any LocalCacheDirectoryManager in use will have to recover its state
accordingly.
bq. checksum?
I would rather not tie a checksum to this. Corruption of the file isn't
related to whether the NM is restarting, and it seems odd to only check for
corruption on restart rather than every time the resource is requested. IMHO
we should treat checksums for localized resources as an orthogonal feature
request to this. (It would also significantly slow down the recovery time if
the NM had to checksum-compare everything in the distcache on startup.)
bq. Do we need to store the symlink we are creating?
I don't see a need to separately persist information on the symlinks to
resources in each container working directory. Either the container was
successfully localized or it wasn't when we restarted. If it was then we leave
it as-is, and the symlink will be reaped when the container directory is
removed after the container completes. If the container was still localizing
when we restarted then the simplest thing to do in the short-term is to fail
the localization (and possibly retry).
bq. anyone working on this actively?
We have a very rough start on persisting the local cache state, and I plan on
working on this in earnest in the next few weeks.
> Recover localized resource cache state upon nodemanager restart
> ---------------------------------------------------------------
>
> Key: YARN-1338
> URL: https://issues.apache.org/jira/browse/YARN-1338
> Project: Hadoop YARN
> Issue Type: Sub-task
> Components: nodemanager
> Affects Versions: 2.3.0
> Reporter: Jason Lowe
> Assignee: Ravi Prakash
>
> Today when node manager restarts we clean up all the distributed cache files
> from disk. This is definitely not ideal from 2 aspects.
> * For work preserving restart we definitely want them as running containers
> are using them
> * For even non work preserving restart this will be useful in the sense that
> we don't have to download them again if needed by future tasks.
--
This message was sent by Atlassian JIRA
(v6.1#6144)