kyungwan nam created YARN-9691:
----------------------------------
Summary: canceling upgrade does not work if upgrade failed
container is existing
Key: YARN-9691
URL: https://issues.apache.org/jira/browse/YARN-9691
Project: Hadoop YARN
Issue Type: Bug
Reporter: kyungwan nam
Assignee: kyungwan nam
if a container is failed to upgrade during yarn service upgrade, it will be
released container and transition to FAILED_UPGRADE state.
After then, I expected it is able to be back to the previous version using
cancel-upgrade. but, It didn’t work.
At that time, AM log is as follows
{code}
# failed to upgrade container_e62_1563179597798_0006_01_000008
2019-07-16 18:21:55,152 [IPC Server handler 0 on 39483] INFO
service.ClientAMService - Upgrade container
container_e62_1563179597798_0006_01_000008
2019-07-16 18:21:55,153 [Component dispatcher] INFO
instance.ComponentInstance - [COMPINSTANCE sleep-0 :
container_e62_1563179597798_0006_01_000008] spec state state changed from
NEEDS_UPGRADE -> UPGRADING
2019-07-16 18:21:55,154 [Component dispatcher] INFO
instance.ComponentInstance - [COMPINSTANCE sleep-0 :
container_e62_1563179597798_0006_01_000008] Transitioned from READY to
UPGRADING on UPGRADE event
2019-07-16 18:21:55,154 [pool-5-thread-4] INFO
registry.YarnRegistryViewForProviders - [COMPINSTANCE sleep-0 :
container_e62_1563179597798_0006_01_000008]: Deleting registry path
/users/test/services/yarn-service/sleeptest/components/ctr-e62-1563179597798-0006-01-000008
2019-07-16 18:21:55,156 [pool-6-thread-6] INFO provider.ProviderUtils -
[COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] version
1.0.1 : Creating dir on hdfs:
hdfs://test1.com:8020/user/test/.yarn/services/sleeptest/components/1.0.1/sleep/sleep-0
2019-07-16 18:21:55,157 [pool-6-thread-6] INFO
containerlaunch.ContainerLaunchService - reInitializing container
container_e62_1563179597798_0006_01_000008 with version 1.0.1
2019-07-16 18:21:55,157 [pool-6-thread-6] INFO
containerlaunch.AbstractLauncher - yarn docker env var has been set
{LANGUAGE=en_US.UTF-8, HADOOP_USER_NAME=test,
YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_HOSTNAME=sleep-0.sleeptest.test.EXAMPLE.COM,
WORK_DIR=$PWD, LC_ALL=en_US.UTF-8, YARN_CONTAINER_RUNTIME_TYPE=docker,
YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=registry.test.com/test/sleep1:latest,
LANG=en_US.UTF-8, YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=bridge,
YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE=true, LOG_DIR=<LOG_DIR>}
2019-07-16 18:21:55,158
[org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #7] INFO
impl.NMClientAsyncImpl - Processing Event EventType: REINITIALIZE_CONTAINER for
Container container_e62_1563179597798_0006_01_000008
2019-07-16 18:21:55,167 [Component dispatcher] INFO
instance.ComponentInstance - [COMPINSTANCE sleep-0 :
container_e62_1563179597798_0006_01_000008] spec state state changed from
UPGRADING -> RUNNING_BUT_UNREADY
2019-07-16 18:21:55,167 [Component dispatcher] INFO
instance.ComponentInstance - [COMPINSTANCE sleep-0 :
container_e62_1563179597798_0006_01_000008] retrieve status after 30
2019-07-16 18:21:55,167 [Component dispatcher] INFO
instance.ComponentInstance - [COMPINSTANCE sleep-0 :
container_e62_1563179597798_0006_01_000008] Transitioned from UPGRADING to
REINITIALIZED on START event
2019-07-16 18:22:07,797 [pool-7-thread-1] INFO monitor.ServiceMonitor -
Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:07 KST
2019", outcome="failure", message="Failure in Default probe: IP presence",
exception="java.io.IOException: sleep-0: IP is not available yet"
2019-07-16 18:22:37,797 [pool-7-thread-1] INFO monitor.ServiceMonitor -
Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:37 KST
2019", outcome="failure", message="Failure in Default probe: IP presence",
exception="java.io.IOException: sleep-0: IP is not available yet"
2019-07-16 18:23:07,797 [pool-7-thread-1] INFO monitor.ServiceMonitor -
Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:23:07 KST
2019", outcome="failure", message="Failure in Default probe: IP presence",
exception="java.io.IOException: sleep-0: IP is not available yet"
2019-07-16 18:23:08,225 [Component dispatcher] INFO
instance.ComponentInstance - [COMPINSTANCE sleep-0 :
container_e62_1563179597798_0006_01_000008] spec state state changed from
RUNNING_BUT_UNREADY -> FAILED_UPGRADE
# request canceling upgrade
2019-07-16 18:28:22,713 [Component dispatcher] INFO service.ServiceManager -
Upgrade container container_e62_1563179597798_0006_01_000004 true
2019-07-16 18:28:22,713 [Component dispatcher] INFO service.ServiceManager -
Upgrade container container_e62_1563179597798_0006_01_000003 true
2019-07-16 18:28:22,713 [Component dispatcher] INFO service.ServiceManager -
Upgrade container container_e62_1563179597798_0006_01_000008 true
2019-07-16 18:28:22,713 [Component dispatcher] INFO service.ServiceManager -
[SERVICE] spec state changed from UPGRADING -> CANCEL_UPGRADING
2019-07-16 18:28:22,713 [Component dispatcher] INFO component.Component -
[COMPONENT sleep]: need upgrade to 1.0.0
2019-07-16 18:28:22,713 [Component dispatcher] INFO
instance.ComponentInstance - [COMPINSTANCE sleep-0 :
container_e62_1563179597798_0006_01_000008] spec state state changed from
FAILED_UPGRADE -> NEEDS_UPGRADE
2019-07-16 18:28:22,713 [Component dispatcher] INFO component.Component -
[COMPONENT sleep] Transitioned from UPGRADING to CANCEL_UPGRADING on
CANCEL_UPGRADE event.
2019-07-16 18:28:22,713 [Component dispatcher] INFO component.Component -
[COMPONENT sleep1]: need upgrade to 1.0.0
2019-07-16 18:28:22,714 [Component dispatcher] INFO component.Component -
[COMPONENT sleep1] Transitioned from UPGRADING to CANCEL_UPGRADING on
CANCEL_UPGRADE event.
2019-07-16 18:28:22,714 [Component dispatcher] INFO
instance.ComponentInstance - container_e62_1563179597798_0006_01_000004 nothing
to cancel
2019-07-16 18:28:22,714 [Component dispatcher] INFO
instance.ComponentInstance - [COMPINSTANCE sleep-2 :
container_e62_1563179597798_0006_01_000004] spec state state changed from
NEEDS_UPGRADE -> READY
2019-07-16 18:28:22,714 [Component dispatcher] INFO
instance.ComponentInstance - container_e62_1563179597798_0006_01_000003 nothing
to cancel
2019-07-16 18:28:22,714 [Component dispatcher] INFO
instance.ComponentInstance - [COMPINSTANCE sleep-1 :
container_e62_1563179597798_0006_01_000003] spec state state changed from
NEEDS_UPGRADE -> READY
2019-07-16 18:28:22,714 [Component dispatcher] ERROR service.ServiceScheduler
- No component instance exists for container_e62_1563179597798_0006_01_000008
{code}
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]