[
https://issues.apache.org/jira/browse/YARN-8346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487175#comment-16487175
]
Rohith Sharma K S commented on YARN-8346:
-----------------------------------------
In class ContainerScheduler#enqueueContainer, for recovered container from
2.8.4 execution type is not set which result in else condition with zero queue
lenght. This is sending kill event for container resulting running containers
are killed.
{code}
private boolean enqueueContainer(Container container) {
boolean isGuaranteedContainer = container.getContainerTokenIdentifier().
getExecutionType() == ExecutionType.GUARANTEED;
boolean isQueued;
if (isGuaranteedContainer) {
queuedGuaranteedContainers.put(container.getContainerId(), container);
isQueued = true;
} else {
if (queuedOpportunisticContainers.size() < maxOppQueueLength) {
LOG.info("Opportunistic container {} will be queued at the NM.",
container.getContainerId());
queuedOpportunisticContainers.put(
container.getContainerId(), container);
isQueued = true;
} else {
LOG.info("Opportunistic container [{}] will not be queued at the NM" +
"since max queue length [{}] has been reached",
container.getContainerId(), maxOppQueueLength);
container.sendKillEvent(
ContainerExitStatus.KILLED_BY_CONTAINER_SCHEDULER,
"Opportunistic container queue is full.");
isQueued = false;
}
}
{code}
Since opportunistic container feature is exist in 2.9, this would also issue
upgrading into 2.9 I think.
cc:/ [~jlowe] [[email protected]]
> Upgrading to 3.1 kills running containers with error "Opportunistic container
> queue is full"
> --------------------------------------------------------------------------------------------
>
> Key: YARN-8346
> URL: https://issues.apache.org/jira/browse/YARN-8346
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Rohith Sharma K S
> Priority: Major
>
> It is observed while rolling upgrade from 2.8.4 to 3.1 release, all the
> running containers are killed and second attempt is launched for that
> application. The diagnostics message is "Opportunistic container queue is
> full" which is the reason for container killed.
> In NM log, I see below logs for after container is recovered.
> {noformat}
> 2018-05-23 17:18:50,655 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.scheduler.ContainerScheduler:
> Opportunistic container [container_e06_1527075664705_0001_01_000001] will
> not be queued at the NMsince max queue length [0] has been reached
> {noformat}
> Following steps are executed for rolling upgrade
> # Install 2.8.4 cluster and launch a MR job with distributed cache enabled.
> # Stop 2.8.4 RM. Start 3.1.0 RM with same configuration.
> # Stop 2.8.4 NM batch by batch. Start 3.1.0 NM batch by batch.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]