[
https://issues.apache.org/jira/browse/YARN-5767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Trezzo updated YARN-5767:
-------------------------------
Attachment: YARN-5767-trunk-v4.patch
Attached is a v4 patch for trunk addressing all comments from the reviews.
Thanks!
> Fix the order that resources are cleaned up from the local Public/Private
> caches
> --------------------------------------------------------------------------------
>
> Key: YARN-5767
> URL: https://issues.apache.org/jira/browse/YARN-5767
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.6.0, 2.7.0, 3.0.0-alpha1
> Reporter: Chris Trezzo
> Assignee: Chris Trezzo
> Attachments: YARN-5767-trunk-v1.patch, YARN-5767-trunk-v2.patch,
> YARN-5767-trunk-v3.patch, YARN-5767-trunk-v4.patch
>
>
> If you look at {{ResourceLocalizationService#handleCacheCleanup}}, you can
> see that public resources are added to the {{ResourceRetentionSet}} first
> followed by private resources:
> {code:java}
> private void handleCacheCleanup(LocalizationEvent event) {
> ResourceRetentionSet retain =
> new ResourceRetentionSet(delService, cacheTargetSize);
> retain.addResources(publicRsrc);
> if (LOG.isDebugEnabled()) {
> LOG.debug("Resource cleanup (public) " + retain);
> }
> for (LocalResourcesTracker t : privateRsrc.values()) {
> retain.addResources(t);
> if (LOG.isDebugEnabled()) {
> LOG.debug("Resource cleanup " + t.getUser() + ":" + retain);
> }
> }
> //TODO Check if appRsrcs should also be added to the retention set.
> }
> {code}
> Unfortunately, if we look at {{ResourceRetentionSet#addResources}} we see
> that this means public resources are deleted first until the target cache
> size is met:
> {code:java}
> public void addResources(LocalResourcesTracker newTracker) {
> for (LocalizedResource resource : newTracker) {
> currentSize += resource.getSize();
> if (resource.getRefCount() > 0) {
> // always retain resources in use
> continue;
> }
> retain.put(resource, newTracker);
> }
> for (Iterator<Map.Entry<LocalizedResource,LocalResourcesTracker>> i =
> retain.entrySet().iterator();
> currentSize - delSize > targetSize && i.hasNext();) {
> Map.Entry<LocalizedResource,LocalResourcesTracker> rsrc = i.next();
> LocalizedResource resource = rsrc.getKey();
> LocalResourcesTracker tracker = rsrc.getValue();
> if (tracker.remove(resource, delService)) {
> delSize += resource.getSize();
> i.remove();
> }
> }
> }
> {code}
> The result of this is that resources in the private cache are only deleted in
> the cases where:
> # The cache size is larger than the target cache size and the public cache is
> empty.
> # The cache size is larger than the target cache size and everything in the
> public cache is being used by a running container.
> For clusters that primarily use the public cache (i.e. make use of the shared
> cache), this means that the most commonly used resources can be deleted
> before old resources in the private cache. Furthermore, the private cache can
> continue to grow over time causing more and more churn in the public cache.
> Additionally, the same problem exists within the private cache. Since
> resources are added to the retention set on a user by user basis, resources
> will get cleaned up one user at a time in the order that privateRsrc.values()
> returns the LocalResourcesTracker. So if user1 has 10MB in their cache and
> user2 has 100MB in their cache and the target size of the cache is 50MB,
> user1 could potentially have their entire cache removed before anything is
> deleted from the user2 cache.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]