[ 
https://issues.apache.org/jira/browse/YARN-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15250811#comment-15250811
 ] 

Vinod Kumar Vavilapalli commented on YARN-4697:
-----------------------------------------------

bq. In the case that we have had a problem with log aggregation this could 
cause a problem on restart. The number of threads created at that point could 
be huge and will put a large load on the NameNode and in worse case could even 
bring it down due to file descriptor issues.
[~haibochen] / [~rkanter] / [~vvasudev] / [~leftnoteasy], while I agree with a 
general notion that the thread pool limits are good, I actually fail to see how 
this problem is happening. The number of log-aggregation threads should be 
limited by the number of concurrent applications running and finishing in a 
cluster which should be in the order of thousands. Is there something special 
happening at restart time?

My concern is that if don't fix the root-cause, though we've protected 
ourselves from crashes, we'd just be queueing a lot of aggregation processes 
and causing long waiting times.

The other things we should do with this patch is that each thread should 
identify the current application being aggregated, so that we can debug issues 
better.

> NM aggregation thread pool is not bound by limits
> -------------------------------------------------
>
>                 Key: YARN-4697
>                 URL: https://issues.apache.org/jira/browse/YARN-4697
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: nodemanager
>            Reporter: Haibo Chen
>            Assignee: Haibo Chen
>            Priority: Critical
>             Fix For: 2.9.0
>
>         Attachments: yarn4697.001.patch, yarn4697.002.patch, 
> yarn4697.003.patch, yarn4697.004.patch
>
>
> In the LogAggregationService.java we create a threadpool to upload logs from 
> the nodemanager to HDFS if log aggregation is turned on. This is a cached 
> threadpool which based on the javadoc is an ulimited pool of threads.
> In the case that we have had a problem with log aggregation this could cause 
> a problem on restart. The number of threads created at that point could be 
> huge and will put a large load on the NameNode and in worse case could even 
> bring it down due to file descriptor issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to