[ 
https://issues.apache.org/jira/browse/YARN-1338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13820217#comment-13820217
 ] 

Jason Lowe commented on YARN-1338:
----------------------------------

bq. RemoteUrl (Here do we need to trust that the old and new url are 
identical..not changed)?

To resolve that we need to also persist the {{localrsrc}} map in 
LocalResourcesTrackerImpl as it allows us to tell if a resource being requested 
is one we already have.  The request contains the timestamp of the remote 
resource which is combined with the remote path, resource type, and any pattern 
specified to identify the resource being requested.  So if we persist the 
LocalResourceRequest to LocalizedResource map then we can tell after a recovery 
whether we already have the requested resource or not when a new request 
arrives.

As for whether the remote resource has changed, arguably the point is moot wrt. 
recovery.  The containers already running don't want us to change the resource 
they're currently using even if it has changed remotely.  Any new resources 
requested for the same path will compare against the recovered resource 
requests and realize that the remote resource has changed (e.g.: new timestamp) 
and therefore it's a different resource request than the one already persisted 
and needs to be localized separately.

bq. we store the resources inside the distributed cache in an hierarchical 
manner (to avoid unix directory limit)... we may need to recover that too).

Yes, any LocalCacheDirectoryManager in use will have to recover its state 
accordingly.

bq. checksum?

I would rather not tie a checksum to this.  Corruption of the file isn't 
related to whether the NM is restarting, and it seems odd to only check for 
corruption on restart rather than every time the resource is requested.  IMHO 
we should treat checksums for localized resources as an orthogonal feature 
request to this.  (It would also significantly slow down the recovery time if 
the NM had to checksum-compare everything in the distcache on startup.)

bq. Do we need to store the symlink we are creating?

I don't see a need to separately persist information on the symlinks to 
resources in each container working directory.  Either the container was 
successfully localized or it wasn't when we restarted.  If it was then we leave 
it as-is, and the symlink will be reaped when the container directory is 
removed after the container completes.  If the container was still localizing 
when we restarted then the simplest thing to do in the short-term is to fail 
the localization (and possibly retry).

bq. anyone working on this actively?

We have a very rough start on persisting the local cache state, and I plan on 
working on this in earnest in the next few weeks.

> Recover localized resource cache state upon nodemanager restart
> ---------------------------------------------------------------
>
>                 Key: YARN-1338
>                 URL: https://issues.apache.org/jira/browse/YARN-1338
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager
>    Affects Versions: 2.3.0
>            Reporter: Jason Lowe
>            Assignee: Ravi Prakash
>
> Today when node manager restarts we clean up all the distributed cache files 
> from disk. This is definitely not ideal from 2 aspects.
> * For work preserving restart we definitely want them as running containers 
> are using them
> * For even non work preserving restart this will be useful in the sense that 
> we don't have to download them again if needed by future tasks.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to