[
https://issues.apache.org/jira/browse/YARN-5767?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Chris Trezzo updated YARN-5767:
-------------------------------
Description:
If you look at {{ResourceLocalizationService#handleCacheCleanup}}, you can see
that public resources are added to the {{ResourceRetentionSet}} first followed
by private resources:
{code:java}
private void handleCacheCleanup(LocalizationEvent event) {
ResourceRetentionSet retain =
new ResourceRetentionSet(delService, cacheTargetSize);
retain.addResources(publicRsrc);
if (LOG.isDebugEnabled()) {
LOG.debug("Resource cleanup (public) " + retain);
}
for (LocalResourcesTracker t : privateRsrc.values()) {
retain.addResources(t);
if (LOG.isDebugEnabled()) {
LOG.debug("Resource cleanup " + t.getUser() + ":" + retain);
}
}
//TODO Check if appRsrcs should also be added to the retention set.
}
{code}
Unfortunately, if we look at {{ResourceRetentionSet#addResources}} we see that
this means public resources are deleted first until the target cache size is
met:
{code:java}
public void addResources(LocalResourcesTracker newTracker) {
for (LocalizedResource resource : newTracker) {
currentSize += resource.getSize();
if (resource.getRefCount() > 0) {
// always retain resources in use
continue;
}
retain.put(resource, newTracker);
}
for (Iterator<Map.Entry<LocalizedResource,LocalResourcesTracker>> i =
retain.entrySet().iterator();
currentSize - delSize > targetSize && i.hasNext();) {
Map.Entry<LocalizedResource,LocalResourcesTracker> rsrc = i.next();
LocalizedResource resource = rsrc.getKey();
LocalResourcesTracker tracker = rsrc.getValue();
if (tracker.remove(resource, delService)) {
delSize += resource.getSize();
i.remove();
}
}
}
{code}
The result of this is that resources in the private cache are only deleted in
the cases where the cache size is larger than the target cache size and the
public cache is empty, or everything in the public cache is being used by a
running container. For clusters that primarily use the public cache (i.e. make
use of the shared cache), this means that the most commonly used resources can
be deleted before old resources in the private cache. Furthermore, the private
cache can continue to grow over time causing more and more churn in the public
cache.
Additionally, the same problem exists within the private cache. Since resources
are added to the retention set on a user by user basis, resources will get
cleaned up one user at a time in the order that privateRsrc.values() returns
the LocalResourcesTracker. So if user1 has 10MB in their cache and user2 has
100MB in their cache and the target size of the cache is 50MB, user1 could
potentially have their entire cache removed before anything is deleted from the
user2 cache.
was:
If you look at {{ResourceLocalizationService#handleCacheCleanup}}, you can see
that public resources are added to the {{ResourceRetentionSet}} first followed
by private resources:
{code:java}
private void handleCacheCleanup(LocalizationEvent event) {
ResourceRetentionSet retain =
new ResourceRetentionSet(delService, cacheTargetSize);
retain.addResources(publicRsrc);
if (LOG.isDebugEnabled()) {
LOG.debug("Resource cleanup (public) " + retain);
}
for (LocalResourcesTracker t : privateRsrc.values()) {
retain.addResources(t);
if (LOG.isDebugEnabled()) {
LOG.debug("Resource cleanup " + t.getUser() + ":" + retain);
}
}
//TODO Check if appRsrcs should also be added to the retention set.
}
{code}
Unfortunately, if we look at {{ResourceRetentionSet#addResources}} we see that
this means public resources are deleted first until the target cache size is
met:
{code:java}
public void addResources(LocalResourcesTracker newTracker) {
for (LocalizedResource resource : newTracker) {
currentSize += resource.getSize();
if (resource.getRefCount() > 0) {
// always retain resources in use
continue;
}
retain.put(resource, newTracker);
}
for (Iterator<Map.Entry<LocalizedResource,LocalResourcesTracker>> i =
retain.entrySet().iterator();
currentSize - delSize > targetSize && i.hasNext();) {
Map.Entry<LocalizedResource,LocalResourcesTracker> rsrc = i.next();
LocalizedResource resource = rsrc.getKey();
LocalResourcesTracker tracker = rsrc.getValue();
if (tracker.remove(resource, delService)) {
delSize += resource.getSize();
i.remove();
}
}
}
{code}
The result of this is that resources in the private cache are only deleted in
the cases where the cache size is larger than the target cache size and the
public cache is empty, or everything in the public cache is being used by a
running container. For clusters that primarily use the public cache (i.e. make
use of the shared cache), this means that the most commonly used resources can
be deleted before old resources in the private cache. Furthermore, the private
cache can continue to grow over time causing more and more churn in the public
cache.
Additionally, the same problem exists within the private cache. Since resources
are added to the retention set on a user by user basis, resources will get
cleaned up one user at a time in the order that privateRsrc.values() returns
the LocalResourcesTracker. So if user1 has 10MB in their cache and user2 has
100MB in the cache and the target size of the cache is 50MB. User1 could
potentially have their entire cache removed before anything is deleted from the
user2 cache.
> Fix the order that resources are cleaned up from the local Public/Private
> caches
> --------------------------------------------------------------------------------
>
> Key: YARN-5767
> URL: https://issues.apache.org/jira/browse/YARN-5767
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 2.7.3, 2.6.5, 3.0.0-alpha1
> Reporter: Chris Trezzo
> Assignee: Chris Trezzo
>
> If you look at {{ResourceLocalizationService#handleCacheCleanup}}, you can
> see that public resources are added to the {{ResourceRetentionSet}} first
> followed by private resources:
> {code:java}
> private void handleCacheCleanup(LocalizationEvent event) {
> ResourceRetentionSet retain =
> new ResourceRetentionSet(delService, cacheTargetSize);
> retain.addResources(publicRsrc);
> if (LOG.isDebugEnabled()) {
> LOG.debug("Resource cleanup (public) " + retain);
> }
> for (LocalResourcesTracker t : privateRsrc.values()) {
> retain.addResources(t);
> if (LOG.isDebugEnabled()) {
> LOG.debug("Resource cleanup " + t.getUser() + ":" + retain);
> }
> }
> //TODO Check if appRsrcs should also be added to the retention set.
> }
> {code}
> Unfortunately, if we look at {{ResourceRetentionSet#addResources}} we see
> that this means public resources are deleted first until the target cache
> size is met:
> {code:java}
> public void addResources(LocalResourcesTracker newTracker) {
> for (LocalizedResource resource : newTracker) {
> currentSize += resource.getSize();
> if (resource.getRefCount() > 0) {
> // always retain resources in use
> continue;
> }
> retain.put(resource, newTracker);
> }
> for (Iterator<Map.Entry<LocalizedResource,LocalResourcesTracker>> i =
> retain.entrySet().iterator();
> currentSize - delSize > targetSize && i.hasNext();) {
> Map.Entry<LocalizedResource,LocalResourcesTracker> rsrc = i.next();
> LocalizedResource resource = rsrc.getKey();
> LocalResourcesTracker tracker = rsrc.getValue();
> if (tracker.remove(resource, delService)) {
> delSize += resource.getSize();
> i.remove();
> }
> }
> }
> {code}
> The result of this is that resources in the private cache are only deleted in
> the cases where the cache size is larger than the target cache size and the
> public cache is empty, or everything in the public cache is being used by a
> running container. For clusters that primarily use the public cache (i.e.
> make use of the shared cache), this means that the most commonly used
> resources can be deleted before old resources in the private cache.
> Furthermore, the private cache can continue to grow over time causing more
> and more churn in the public cache.
> Additionally, the same problem exists within the private cache. Since
> resources are added to the retention set on a user by user basis, resources
> will get cleaned up one user at a time in the order that privateRsrc.values()
> returns the LocalResourcesTracker. So if user1 has 10MB in their cache and
> user2 has 100MB in their cache and the target size of the cache is 50MB,
> user1 could potentially have their entire cache removed before anything is
> deleted from the user2 cache.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]