[jira] [Updated] (YARN-6032) SharedCacheManager cleaner task should rm InMemorySCMStore some cachedResources which does not exists in hdfs fs

2016-12-27 Thread Zhaofei Meng (JIRA)

 [ 
https://issues.apache.org/jira/browse/YARN-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zhaofei Meng updated YARN-6032:
---
Summary:  SharedCacheManager cleaner task should rm InMemorySCMStore some 
cachedResources which does not exists in hdfs fs  (was:  scm cleaner task 
should rm InMemorySCMStore some cachedResources which does not exists in hdfs 
fs)

>  SharedCacheManager cleaner task should rm InMemorySCMStore some 
> cachedResources which does not exists in hdfs fs
> -
>
> Key: YARN-6032
> URL: https://issues.apache.org/jira/browse/YARN-6032
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Zhaofei Meng
> Fix For: 2.7.1
>
>
> If cacheresources exist in scm but not exist in hdfs,the cacheresources  
> whill not rm from scm until restart scm.So we shoult add check funcion in 
> cleaner task that  rm the cachedResources which does not exists in hdfs fs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-6032) scm cleaner task should rm InMemorySCMStore some cachedResources which does not exists in hdfs fs

2016-12-27 Thread Zhaofei Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-6032?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782180#comment-15782180
 ] 

Zhaofei Meng commented on YARN-6032:


We should modify use interface in ClientProtocolService to verify fs if or not 
exist in hdfs.

public UseSharedCacheResourceResponse use(
  UseSharedCacheResourceRequest request) throws YarnException,
  IOException {

UseSharedCacheResourceResponse response =
recordFactory.newRecordInstance(UseSharedCacheResourceResponse.class);

UserGroupInformation callerUGI;
try {
  callerUGI = UserGroupInformation.getCurrentUser();
} catch (IOException ie) {
  LOG.info("Error getting UGI ", ie);
  throw RPCUtil.getRemoteException(ie);
}

String fileName =
this.store.addResourceReference(request.getResourceKey(),
new SharedCacheResourceReference(request.getAppId(),
callerUGI.getShortUserName()));

if (fileName != null) {
  if(fs.exists(new Path(fs.getHomeDirectory(),fileName))){
response
.setPath(getCacheEntryFilePath(request.getResourceKey(), 
fileName));
this.metrics.incCacheHitCount();
  }else{
this.store.removeResource(request.getResourceKey());
  }
} else {
  this.metrics.incCacheMissCount();
}

return response;
  }

>  scm cleaner task should rm InMemorySCMStore some cachedResources which does 
> not exists in hdfs fs
> --
>
> Key: YARN-6032
> URL: https://issues.apache.org/jira/browse/YARN-6032
> Project: Hadoop YARN
>  Issue Type: Bug
>Affects Versions: 2.7.1
>Reporter: Zhaofei Meng
> Fix For: 2.7.1
>
>
> If cacheresources exist in scm but not exist in hdfs,the cacheresources  
> whill not rm from scm until restart scm.So we shoult add check funcion in 
> cleaner task that  rm the cachedResources which does not exists in hdfs fs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)

2016-12-27 Thread Zhaofei Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15782173#comment-15782173
 ] 

Zhaofei Meng commented on YARN-1492:


Another problem YARN-6032

> truly shared cache for jars (jobjar/libjar)
> ---
>
> Key: YARN-1492
> URL: https://issues.apache.org/jira/browse/YARN-1492
> Project: Hadoop YARN
>  Issue Type: New Feature
>Affects Versions: 2.0.4-alpha
>Reporter: Sangjin Lee
>Assignee: Chris Trezzo
> Attachments: YARN-1492-all-trunk-v1.patch, 
> YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, 
> YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, 
> shared_cache_design.pdf, shared_cache_design_v2.pdf, 
> shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, 
> shared_cache_design_v5.pdf, shared_cache_design_v6.pdf
>
>
> Currently there is the distributed cache that enables you to cache jars and 
> files so that attempts from the same job can reuse them. However, sharing is 
> limited with the distributed cache because it is normally on a per-job basis. 
> On a large cluster, sometimes copying of jobjars and libjars becomes so 
> prevalent that it consumes a large portion of the network bandwidth, not to 
> speak of defeating the purpose of "bringing compute to where data is". This 
> is wasteful because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared 
> cache so that multiple jobs from multiple users can share and cache jars. 
> This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-6032) scm cleaner task should rm InMemorySCMStore some cachedResources which does not exists in hdfs fs

2016-12-27 Thread Zhaofei Meng (JIRA)
Zhaofei Meng created YARN-6032:
--

 Summary:  scm cleaner task should rm InMemorySCMStore some 
cachedResources which does not exists in hdfs fs
 Key: YARN-6032
 URL: https://issues.apache.org/jira/browse/YARN-6032
 Project: Hadoop YARN
  Issue Type: Bug
Affects Versions: 2.7.1
Reporter: Zhaofei Meng
 Fix For: 2.7.1


If cacheresources exist in scm but not exist in hdfs,the cacheresources  whill 
not rm from scm until restart scm.So we shoult add check funcion in cleaner 
task that  rm the cachedResources which does not exists in hdfs fs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-2663) Race condintion in shared cache CleanerTask during deletion of resource

2016-12-27 Thread Zhaofei Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-2663?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15780152#comment-15780152
 ] 

Zhaofei Meng commented on YARN-2663:


Cleaner task rm hdfs resource after rm scm cache.Uploader task uploader add scm 
cache after upload hdfs resource.
Add lock for cleaner task and nm uploader task to controll sequence of rm scm 
cache and hdfs resource.

> Race condintion in shared cache CleanerTask during deletion of resource
> ---
>
> Key: YARN-2663
> URL: https://issues.apache.org/jira/browse/YARN-2663
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Chris Trezzo
>Priority: Blocker
>
> In CleanerTask, store.removeResource(key) and 
> removeResourceFromCacheFileSystem(path) do not happen together in atomic 
> fashion.
> Since resources could be uploaded with different file names, the SCM could 
> receive a notification to add a resource to the SCM between the two 
> operations. Thus, we have a scenario where the cleaner service deletes the 
> entry from the scm, receives a notification from the uploader (adding the 
> entry back into the scm) and then deletes the file from HDFS.
> Cleaner code that deletes resource:
> {code}
>   if (store.isResourceEvictable(key, resource)) {
> try {
>   /*
>* TODO: There is a race condition between store.removeResource(key)
>* and removeResourceFromCacheFileSystem(path) operations because 
> they
>* do not happen atomically and resources can be uploaded with
>* different file names by the node managers.
>*/
>   // remove the resource from scm (checks for appIds as well)
>   if (store.removeResource(key)) {
> // remove the resource from the file system
> boolean deleted = removeResourceFromCacheFileSystem(path);
> if (deleted) {
>   resourceStatus = ResourceStatus.DELETED;
> } else {
>   LOG.error("Failed to remove path from the file system."
>   + " Skipping this resource: " + path);
>   resourceStatus = ResourceStatus.ERROR;
> }
>   } else {
> // we did not delete the resource because it contained application
> // ids
> resourceStatus = ResourceStatus.PROCESSED;
>   }
> } catch (IOException e) {
>   LOG.error(
>   "Failed to remove path from the file system. Skipping this 
> resource: "
>   + path, e);
>   resourceStatus = ResourceStatus.ERROR;
> }
>   } else {
> resourceStatus = ResourceStatus.PROCESSED;
>   }
> {code}
> Uploader code that uploads resource:
> {code}
>   // create the temporary file
>   tempPath = new Path(directoryPath, getTemporaryFileName(actualPath));
>   if (!uploadFile(actualPath, tempPath)) {
> LOG.warn("Could not copy the file to the shared cache at " + 
> tempPath);
> return false;
>   }
>   // set the permission so that it is readable but not writable
>   // TODO should I create the file with the right permission so I save the
>   // permission call?
>   fs.setPermission(tempPath, FILE_PERMISSION);
>   // rename it to the final filename
>   Path finalPath = new Path(directoryPath, actualPath.getName());
>   if (!fs.rename(tempPath, finalPath)) {
> LOG.warn("The file already exists under " + finalPath +
> ". Ignoring this attempt.");
> deleteTempFile(tempPath);
> return false;
>   }
>   // notify the SCM
>   if (!notifySharedCacheManager(checksumVal, actualPath.getName())) {
> // the shared cache manager rejected the upload (as it is likely
> // uploaded under a different name
> // clean up this file and exit
> fs.delete(finalPath, false);
> return false;
>   }
> {code}
> One solution is to have the UploaderService always rename the resource file 
> to the checksum of the resource plus the extension. With this fix we will 
> never receive a notify for the resource before the delete from the FS has 
> happened because the rename on the node manager will fail. If the node 
> manager uploads the file after it is deleted from the FS, we are ok and the 
> resource will simply get added back to the scm once a notification is 
> received.
> The classpath at the MapReduce layer is still usable because we leverage 
> links to preserve the original client file name.
> The downside is that now the shared cache files in HDFS are less readable. 
> This could be mitigated with an added admin command to the SCM that gives a 
> list of filenames associated with a checksum or vice versa.



--
This message was sent by Atlassian JIRA

[jira] [Commented] (YARN-1492) truly shared cache for jars (jobjar/libjar)

2016-12-26 Thread Zhaofei Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15779423#comment-15779423
 ] 

Zhaofei Meng commented on YARN-1492:


When scm restart,appChecker task will check initialApps state periodically.I 
suggest that shutdown appChecker  task after the all initialApps complete 
because appChecker task will be not useful.

> truly shared cache for jars (jobjar/libjar)
> ---
>
> Key: YARN-1492
> URL: https://issues.apache.org/jira/browse/YARN-1492
> Project: Hadoop YARN
>  Issue Type: New Feature
>Affects Versions: 2.0.4-alpha
>Reporter: Sangjin Lee
>Assignee: Chris Trezzo
> Attachments: YARN-1492-all-trunk-v1.patch, 
> YARN-1492-all-trunk-v2.patch, YARN-1492-all-trunk-v3.patch, 
> YARN-1492-all-trunk-v4.patch, YARN-1492-all-trunk-v5.patch, 
> shared_cache_design.pdf, shared_cache_design_v2.pdf, 
> shared_cache_design_v3.pdf, shared_cache_design_v4.pdf, 
> shared_cache_design_v5.pdf, shared_cache_design_v6.pdf
>
>
> Currently there is the distributed cache that enables you to cache jars and 
> files so that attempts from the same job can reuse them. However, sharing is 
> limited with the distributed cache because it is normally on a per-job basis. 
> On a large cluster, sometimes copying of jobjars and libjars becomes so 
> prevalent that it consumes a large portion of the network bandwidth, not to 
> speak of defeating the purpose of "bringing compute to where data is". This 
> is wasteful because in most cases code doesn't change much across many jobs.
> I'd like to propose and discuss feasibility of introducing a truly shared 
> cache so that multiple jobs from multiple users can share and cache jars. 
> This JIRA is to open the discussion.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Created] (YARN-5971) All Events processed by one dispatcher in rm

2016-12-06 Thread Zhaofei Meng (JIRA)
Zhaofei Meng created YARN-5971:
--

 Summary: All Events processed by one dispatcher in rm
 Key: YARN-5971
 URL: https://issues.apache.org/jira/browse/YARN-5971
 Project: Hadoop YARN
  Issue Type: Improvement
Reporter: Zhaofei Meng


All Events processed by one dispatcher in rm.Is there a way to divide various 
events into mutiple dispatcher?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Commented] (YARN-5955) Use threadpool or multiple thread to recover app

2016-12-01 Thread Zhaofei Meng (JIRA)

[ 
https://issues.apache.org/jira/browse/YARN-5955?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15713671#comment-15713671
 ] 

Zhaofei Meng commented on YARN-5955:


OK.

> Use threadpool or multiple thread to recover app
> 
>
> Key: YARN-5955
> URL: https://issues.apache.org/jira/browse/YARN-5955
> Project: Hadoop YARN
>  Issue Type: Improvement
>  Components: resourcemanager
>Affects Versions: 2.7.1
>Reporter: Zhaofei Meng
>Assignee: Ajith S
> Fix For: 2.7.1
>
>
> current app recovery is one by one,use thead pool can make recovery faster.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org