[
https://issues.apache.org/jira/browse/YARN-5767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Trezzo updated YARN-5767:
-------------------------------
Attachment: YARN-5767-trunk-v1.patch
Attached is a v1 patch for trunk.
In this initial patch I actually went with approach #1.
Here is a summary of the modifications this patch makes:
# Renamed the {{ResourceRetentionSet}} class to {{LocalCacheCleaner}}. In the
patch it looks like a delete/add.
# Modified {{LocalCacheCleaner#addResources}} so that it only adds resources to
the map and does not clean.
# Added a new method {{LocalCacheCleaner#cleanCache}} that is actually
responsible for cleaning the cache. The general intention is that you would add
a bunch of resources to the cleaner, and then call clean. All resources that
the cleaner is aware of at that point will get cleaned up on an LRU basis.
# Added a new stats class to {{LocalCacheCleaner}} that keeps track of the same
stats {{ResourceRetentionSet}} did, plus an optional more detailed breakdown of
what was cleaned from private caches.
# Added a new test class {{TestLocalCacheCleanup}}. This tests a basic cleanup,
a cleanup where there are resources with positive ref counts, tests that the
cleaner is indeed using an LRU policy across both private and public caches,
and finally tests that the cleanup stats are correct.
# Deleted the {{TestRetentionSet}} class because it is now redundant with
{{TestLocalCacheCleanup}}.
Please let me know your thoughts! If there is too much going on in the patch, I
can always break it down into smaller ones. Thanks.
/cc [~jlowe] [~sjlee0]
> Fix the order that resources are cleaned up from the local Public/Private
> caches
> --------------------------------------------------------------------------------
>
> Key: YARN-5767
> URL: https://issues.apache.org/jira/browse/YARN-5767
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.6.0, 2.7.0, 3.0.0-alpha1
> Reporter: Chris Trezzo
> Assignee: Chris Trezzo
> Attachments: YARN-5767-trunk-v1.patch
>
>
> If you look at {{ResourceLocalizationService#handleCacheCleanup}}, you can
> see that public resources are added to the {{ResourceRetentionSet}} first
> followed by private resources:
> {code:java}
> private void handleCacheCleanup(LocalizationEvent event) {
> ResourceRetentionSet retain =
> new ResourceRetentionSet(delService, cacheTargetSize);
> retain.addResources(publicRsrc);
> if (LOG.isDebugEnabled()) {
> LOG.debug("Resource cleanup (public) " + retain);
> }
> for (LocalResourcesTracker t : privateRsrc.values()) {
> retain.addResources(t);
> if (LOG.isDebugEnabled()) {
> LOG.debug("Resource cleanup " + t.getUser() + ":" + retain);
> }
> }
> //TODO Check if appRsrcs should also be added to the retention set.
> }
> {code}
> Unfortunately, if we look at {{ResourceRetentionSet#addResources}} we see
> that this means public resources are deleted first until the target cache
> size is met:
> {code:java}
> public void addResources(LocalResourcesTracker newTracker) {
> for (LocalizedResource resource : newTracker) {
> currentSize += resource.getSize();
> if (resource.getRefCount() > 0) {
> // always retain resources in use
> continue;
> }
> retain.put(resource, newTracker);
> }
> for (Iterator<Map.Entry<LocalizedResource,LocalResourcesTracker>> i =
> retain.entrySet().iterator();
> currentSize - delSize > targetSize && i.hasNext();) {
> Map.Entry<LocalizedResource,LocalResourcesTracker> rsrc = i.next();
> LocalizedResource resource = rsrc.getKey();
> LocalResourcesTracker tracker = rsrc.getValue();
> if (tracker.remove(resource, delService)) {
> delSize += resource.getSize();
> i.remove();
> }
> }
> }
> {code}
> The result of this is that resources in the private cache are only deleted in
> the cases where:
> # The cache size is larger than the target cache size and the public cache is
> empty.
> # The cache size is larger than the target cache size and everything in the
> public cache is being used by a running container.
> For clusters that primarily use the public cache (i.e. make use of the shared
> cache), this means that the most commonly used resources can be deleted
> before old resources in the private cache. Furthermore, the private cache can
> continue to grow over time causing more and more churn in the public cache.
> Additionally, the same problem exists within the private cache. Since
> resources are added to the retention set on a user by user basis, resources
> will get cleaned up one user at a time in the order that privateRsrc.values()
> returns the LocalResourcesTracker. So if user1 has 10MB in their cache and
> user2 has 100MB in their cache and the target size of the cache is 50MB,
> user1 could potentially have their entire cache removed before anything is
> deleted from the user2 cache.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]