[
https://issues.apache.org/jira/browse/YARN-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250811#comment-15250811
]
Vinod Kumar Vavilapalli commented on YARN-4697:
-----------------------------------------------
bq. In the case that we have had a problem with log aggregation this could
cause a problem on restart. The number of threads created at that point could
be huge and will put a large load on the NameNode and in worse case could even
bring it down due to file descriptor issues.
[~haibochen] / [~rkanter] / [~vvasudev] / [~leftnoteasy], while I agree with a
general notion that the thread pool limits are good, I actually fail to see how
this problem is happening. The number of log-aggregation threads should be
limited by the number of concurrent applications running and finishing in a
cluster which should be in the order of thousands. Is there something special
happening at restart time?
My concern is that if don't fix the root-cause, though we've protected
ourselves from crashes, we'd just be queueing a lot of aggregation processes
and causing long waiting times.
The other things we should do with this patch is that each thread should
identify the current application being aggregated, so that we can debug issues
better.
> NM aggregation thread pool is not bound by limits
> -------------------------------------------------
>
> Key: YARN-4697
> URL: https://issues.apache.org/jira/browse/YARN-4697
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager
> Reporter: Haibo Chen
> Assignee: Haibo Chen
> Priority: Critical
> Fix For: 2.9.0
>
> Attachments: yarn4697.001.patch, yarn4697.002.patch,
> yarn4697.003.patch, yarn4697.004.patch
>
>
> In the LogAggregationService.java we create a threadpool to upload logs from
> the nodemanager to HDFS if log aggregation is turned on. This is a cached
> threadpool which based on the javadoc is an ulimited pool of threads.
> In the case that we have had a problem with log aggregation this could cause
> a problem on restart. The number of threads created at that point could be
> huge and will put a large load on the NameNode and in worse case could even
> bring it down due to file descriptor issues.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)