Lin Yiqun created YARN-4381:
-------------------------------
Summary: Add container launchEvent and container localizeFailed
metrics in container
Key: YARN-4381
URL: https://issues.apache.org/jira/browse/YARN-4381
Project: Hadoop YARN
Issue Type: Improvement
Components: nodemanager
Affects Versions: 2.7.1
Reporter: Lin Yiqun
Recently, I found a issue on nodemanager metrics.That's
{{NodeManagerMetrics#containersLaunched}} is not actually means the container
succeed launched times.Because in some time, it will be failed when receiving
the killing command or happening container-localizationFailed.This will lead to
a failed container.But now,this counter value will be increased in these code
whenever the container is started successfully or failed.
{code}
Credentials credentials = parseCredentials(launchContext);
Container container =
new ContainerImpl(getConfig(), this.dispatcher,
context.getNMStateStore(), launchContext,
credentials, metrics, containerTokenIdentifier);
ApplicationId applicationID =
containerId.getApplicationAttemptId().getApplicationId();
if (context.getContainers().putIfAbsent(containerId, container) != null) {
NMAuditLogger.logFailure(user, AuditConstants.START_CONTAINER,
"ContainerManagerImpl", "Container already running on this node!",
applicationID, containerId);
throw RPCUtil.getRemoteException("Container " + containerIdStr
+ " already is running on this node!!");
}
this.readLock.lock();
try {
if (!serviceStopped) {
// Create the application
Application application =
new ApplicationImpl(dispatcher, user, applicationID, credentials,
context);
if (null == context.getApplications().putIfAbsent(applicationID,
application)) {
LOG.info("Creating a new application reference for app " +
applicationID);
LogAggregationContext logAggregationContext =
containerTokenIdentifier.getLogAggregationContext();
Map<ApplicationAccessType, String> appAcls =
container.getLaunchContext().getApplicationACLs();
context.getNMStateStore().storeApplication(applicationID,
buildAppProto(applicationID, user, credentials, appAcls,
logAggregationContext));
dispatcher.getEventHandler().handle(
new ApplicationInitEvent(applicationID, appAcls,
logAggregationContext));
}
this.context.getNMStateStore().storeContainer(containerId, request);
dispatcher.getEventHandler().handle(
new ApplicationContainerInitEvent(container));
this.context.getContainerTokenSecretManager().startContainerSuccessful(
containerTokenIdentifier);
NMAuditLogger.logSuccess(user, AuditConstants.START_CONTAINER,
"ContainerManageImpl", applicationID, containerId);
// TODO launchedContainer misplaced -> doesn't necessarily mean a
container
// launch. A finished Application will not launch containers.
metrics.launchedContainer();
metrics.allocateContainer(containerTokenIdentifier.getResource());
} else {
throw new YarnException(
"Container start failed as the NodeManager is " +
"in the process of shutting down");
}
{code}
In addition, we are lack of localzationFailed metric in container.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)