liyakun created YARN-9480:
-----------------------------
Summary: createAppDir() in LogAggregationService shouldn't block
dispatcher thread of ContainerManagerImpl
Key: YARN-9480
URL: https://issues.apache.org/jira/browse/YARN-9480
Project: Hadoop YARN
Issue Type: Improvement
Components: nodemanager
Reporter: liyakun
Assignee: liyakun
At present, when startContainers(), if NM does not contain the application, it
will enter the step of INIT_APPLICATION. In the application init step,
createAppDir() will be executed, and it is a blocking operation.
createAppDir() is an operation that needs to interact with an external file
system. This operation is affected by the SLA of the external file system. Once
the external file system has a high latency, the NM dispatcher thread of
ContainerManagerImpl will be stuck. (In fact, I have seen a scene that NM stuck
here for more than an hour.)
I think it would be more reasonable to move createAppDir() to the actual time
of uploading log (in other threads). And according to the logRetentionPolicy,
many of the containers may not get to this step, which will save a lot of
interactions with external file system.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]