[
https://issues.apache.org/jira/browse/YARN-4697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15155015#comment-15155015
]
Robert Kanter commented on YARN-4697:
-------------------------------------
Besides [~wilfreds]'s comments, I have some feedback on the unit test:
- We should use more than 1 thread in the thread pool because 1 of something
can sometimes hide problems. Something like 3 would be better.
- In case something goes wrong, it would be good to:
-- add a timeout to the test {{@Test(timeout=30000)}}
-- make the threads not block indefinitely. That can be done by using
[{{tryAcquire(long timeout, TimeUnit
unit)}}|https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/Semaphore.html#tryAcquire(long,%20java.util.concurrent.TimeUnit)]
instead of just {{acquire()}}. If you make the timeout for the threads longer
than the timeout for the test itself, you won't have to worry about any timing
problems with the thread exiting early, while still preventing the threads from
possibly hanging forever
- The way you're searching for threads is okay, but it would be better if we
could get them directly from the thread pool. I see that
{{LogAggregationService}} only exposes an {{ExecutorService}} for the thread
pool, but looking at how it's made, I believe it's actually a
{{ThreadPoolExecutor}} underneath. Can you try casting to
{{ThreadPoolExecutor}} and see if that works? {{ThreadPoolExecutor}} has
methods to check how many threads are running etc. If that doesn't work, then
I'm okay with the current approach.
> NM aggregation thread pool is not bound by limits
> -------------------------------------------------
>
> Key: YARN-4697
> URL: https://issues.apache.org/jira/browse/YARN-4697
> Project: Hadoop YARN
> Issue Type: Improvement
> Components: nodemanager
> Reporter: Haibo Chen
> Assignee: Haibo Chen
> Attachments: yarn4697.001.patch, yarn4697.002.patch
>
>
> In the LogAggregationService.java we create a threadpool to upload logs from
> the nodemanager to HDFS if log aggregation is turned on. This is a cached
> threadpool which based on the javadoc is an ulimited pool of threads.
> In the case that we have had a problem with log aggregation this could cause
> a problem on restart. The number of threads created at that point could be
> huge and will put a large load on the NameNode and in worse case could even
> bring it down due to file descriptor issues.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)