[jira] [Commented] (YARN-3491) Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer).

zhihai xu (JIRA) Wed, 15 Apr 2015 22:08:47 -0700

    [ 
https://issues.apache.org/jira/browse/YARN-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14497565#comment-14497565
 ]


zhihai xu commented on YARN-3491:
---------------------------------

Hi [~jlowe] and [~sjlee0], I think I know what is bottleneck in  
PublicLocalizer#addResource.
I checked the old NM logs from old code in 2.3.0 release. 
PublicLocalizer#addResource took less than one millisecond in 2.3.0 release .
{code}
2014-10-21 18:11:10,956 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Downloading public rsrc:{ 
hdfs://nameservice1/tmp/temp-1620691366/tmp-602532977/asm.jar, 1413914982330, 
FILE, null }
2014-10-21 18:11:10,956 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Downloading public rsrc:{ 
hdfs://nameservice1/tmp/temp-1620691366/tmp-983952127/start.jar, 1413914978818, 
FILE, null }
2014-10-21 18:11:10,957 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Downloading public rsrc:{ 
hdfs://nameservice1/tmp/temp-1620691366/tmp-700474448/jsch.jar, 1413914981670, 
FILE, null }
2014-10-21 18:11:10,957 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Downloading public rsrc:{ 
hdfs://nameservice1/tmp/temp-1620691366/tmp-295789958/kfs.jar, 1413914974035, 
FILE, null }
2014-10-21 18:11:10,957 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Downloading public rsrc:{ 
hdfs://nameservice1/tmp/temp-1620691366/tmp1832142372/datasvc-search.jar, 
1413914970738, FILE, null }
2014-10-21 18:11:10,957 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Downloading public rsrc:{ 
hdfs://nameservice1/tmp/temp-1620691366/tmp-1244404847/args4j.jar, 
1413914982044, FILE, null }
2014-10-21 18:11:10,957 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Downloading public rsrc:{ 
hdfs://nameservice1/tmp/temp-1620691366/tmp729860031/slf4j-log4j12.jar, 
1413914980407, FILE, null }
2014-10-21 18:11:10,957 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Downloading public rsrc:{ 
hdfs://nameservice1/tmp/temp-1620691366/tmp-1748521227/jackson-mapper-asl.jar, 
1413914983142, FILE, null }
2014-10-21 18:11:10,957 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Downloading public rsrc:{ 
hdfs://nameservice1/tmp/temp-1620691366/tmp-246818030/jasper-compiler.jar, 
1413914979243, FILE, null }
2014-10-21 18:11:10,958 INFO 
org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService:
 Downloading public rsrc:{ 
hdfs://nameservice1/tmp/temp-1620691366/tmp-1703279108/spiffy.jar, 
1413914974080, FILE, null }
{code}

Then I compared the public localization code, the difference is at 
LocalResourcesTrackerImpl#getPathForLocalization:
The following code is added after 2.3.0 release:
{code}
    rPath = new Path(rPath,
        Long.toString(uniqueNumberGenerator.incrementAndGet()));
    Path localPath = new Path(rPath, req.getPath().getName());
    LocalizedResource rsrc = localrsrc.get(req);
    rsrc.setLocalPath(localPath);
    LocalResource lr = LocalResource.newInstance(req.getResource(),
        req.getType(), req.getVisibility(), req.getSize(),
        req.getTimestamp());
    try {
      stateStore.startResourceLocalization(user, appId,
          ((LocalResourcePBImpl) lr).getProto(), localPath);
    } catch (IOException e) {
      LOG.error("Unable to record localization start for " + rsrc, e);
    }
{code}

I think most likely stateStore.startResourceLocalization is the bottleneck.
startResourceLocalization stored the state in the levelDB. the levelDB 
operation is time consuming.  It need go through the JNI interface.
{code}
  public void startResourceLocalization(String user, ApplicationId appId,
      LocalResourceProto proto, Path localPath) throws IOException {
    String key = getResourceStartedKey(user, appId, localPath.toString());
    try {
      db.put(bytes(key), proto.toByteArray());
    } catch (DBException e) {
      throw new IOException(e);
    }
  }
{code}
I think it would be better to do these levelDB operations in a separate thread 
using AsyncDispatcher in NMLeveldbStateStoreService.

> Improve the public resource localization to do both FSDownload submission to 
> the thread pool and completed localization handling in one thread 
> (PublicLocalizer).
> -----------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-3491
>                 URL: https://issues.apache.org/jira/browse/YARN-3491
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>    Affects Versions: 2.7.0
>            Reporter: zhihai xu
>            Assignee: zhihai xu
>            Priority: Critical
>
> Improve the public resource localization to do both FSDownload submission to 
> the thread pool and completed localization handling in one thread 
> (PublicLocalizer).
> Currently FSDownload submission to the thread pool is done in 
> PublicLocalizer#addResource which is running in Dispatcher thread and 
> completed localization handling is done in PublicLocalizer#run which is 
> running in PublicLocalizer thread.
> Because PublicLocalizer#addResource is time consuming, the thread pool can't 
> be fully utilized. Instead of doing public resource localization in 
> parallel(multithreading), public resource localization is serialized most of 
> the time.
> Also there are two more benefits with this change:
> 1. The Dispatcher thread won't be blocked by PublicLocalizer#addResource . 
> Dispatcher thread handles most of time critical events at Node manager.
> 2. don't need synchronization on HashMap (pending).
> Because pending will be only accessed in PublicLocalizer thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (YARN-3491) Improve the public resource localization to do both FSDownload submission to the thread pool and completed localization handling in one thread (PublicLocalizer).

Reply via email to