[ 
https://issues.apache.org/jira/browse/YARN-9691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

kyungwan nam updated YARN-9691:
-------------------------------
    Attachment: YARN-9691.002.patch

> canceling upgrade does not work if upgrade failed container is existing
> -----------------------------------------------------------------------
>
>                 Key: YARN-9691
>                 URL: https://issues.apache.org/jira/browse/YARN-9691
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: kyungwan nam
>            Assignee: kyungwan nam
>            Priority: Major
>         Attachments: YARN-9691.001.patch, YARN-9691.002.patch
>
>
> if a container is failed to upgrade during yarn service upgrade, it will be 
> released container and transition to FAILED_UPGRADE state.
> After then, I expected it is able to be back to the previous version using 
> cancel-upgrade. but, It didn’t work.
> At that time, AM log is as follows
> {code}
> # failed to upgrade container_e62_1563179597798_0006_01_000008
> 2019-07-16 18:21:55,152 [IPC Server handler 0 on 39483] INFO  
> service.ClientAMService - Upgrade container 
> container_e62_1563179597798_0006_01_000008
> 2019-07-16 18:21:55,153 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
> container_e62_1563179597798_0006_01_000008] spec state state changed from 
> NEEDS_UPGRADE -> UPGRADING
> 2019-07-16 18:21:55,154 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
> container_e62_1563179597798_0006_01_000008] Transitioned from READY to 
> UPGRADING on UPGRADE event
> 2019-07-16 18:21:55,154 [pool-5-thread-4] INFO  
> registry.YarnRegistryViewForProviders - [COMPINSTANCE sleep-0 : 
> container_e62_1563179597798_0006_01_000008]: Deleting registry path 
> /users/test/services/yarn-service/sleeptest/components/ctr-e62-1563179597798-0006-01-000008
> 2019-07-16 18:21:55,156 [pool-6-thread-6] INFO  provider.ProviderUtils - 
> [COMPINSTANCE sleep-0 : container_e62_1563179597798_0006_01_000008] version 
> 1.0.1 : Creating dir on hdfs: 
> hdfs://test1.com:8020/user/test/.yarn/services/sleeptest/components/1.0.1/sleep/sleep-0
> 2019-07-16 18:21:55,157 [pool-6-thread-6] INFO  
> containerlaunch.ContainerLaunchService - reInitializing container 
> container_e62_1563179597798_0006_01_000008 with version 1.0.1
> 2019-07-16 18:21:55,157 [pool-6-thread-6] INFO  
> containerlaunch.AbstractLauncher - yarn docker env var has been set 
> {LANGUAGE=en_US.UTF-8, HADOOP_USER_NAME=test, 
> YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_HOSTNAME=sleep-0.sleeptest.test.EXAMPLE.COM,
>  WORK_DIR=$PWD, LC_ALL=en_US.UTF-8, YARN_CONTAINER_RUNTIME_TYPE=docker, 
> YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=registry.test.com/test/sleep1:latest, 
> LANG=en_US.UTF-8, YARN_CONTAINER_RUNTIME_DOCKER_CONTAINER_NETWORK=bridge, 
> YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE=true, LOG_DIR=<LOG_DIR>}
> 2019-07-16 18:21:55,158 
> [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #7] INFO  
> impl.NMClientAsyncImpl - Processing Event EventType: REINITIALIZE_CONTAINER 
> for Container container_e62_1563179597798_0006_01_000008
> 2019-07-16 18:21:55,167 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
> container_e62_1563179597798_0006_01_000008] spec state state changed from 
> UPGRADING -> RUNNING_BUT_UNREADY
> 2019-07-16 18:21:55,167 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
> container_e62_1563179597798_0006_01_000008] retrieve status after 30
> 2019-07-16 18:21:55,167 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
> container_e62_1563179597798_0006_01_000008] Transitioned from UPGRADING to 
> REINITIALIZED on START event
> 2019-07-16 18:22:07,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
> Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:07 
> KST 2019", outcome="failure", message="Failure in Default probe: IP 
> presence", exception="java.io.IOException: sleep-0: IP is not available yet"
> 2019-07-16 18:22:37,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
> Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:22:37 
> KST 2019", outcome="failure", message="Failure in Default probe: IP 
> presence", exception="java.io.IOException: sleep-0: IP is not available yet"
> 2019-07-16 18:23:07,797 [pool-7-thread-1] INFO  monitor.ServiceMonitor - 
> Readiness check failed for sleep-0: Probe Status, time="Tue Jul 16 18:23:07 
> KST 2019", outcome="failure", message="Failure in Default probe: IP 
> presence", exception="java.io.IOException: sleep-0: IP is not available yet"
> 2019-07-16 18:23:08,225 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
> container_e62_1563179597798_0006_01_000008] spec state state changed from 
> RUNNING_BUT_UNREADY -> FAILED_UPGRADE
> # request canceling upgrade 
> 2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager 
> - Upgrade container container_e62_1563179597798_0006_01_000004 true
> 2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager 
> - Upgrade container container_e62_1563179597798_0006_01_000003 true
> 2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager 
> - Upgrade container container_e62_1563179597798_0006_01_000008 true
> 2019-07-16 18:28:22,713 [Component  dispatcher] INFO  service.ServiceManager 
> - [SERVICE] spec state changed from UPGRADING -> CANCEL_UPGRADING
> 2019-07-16 18:28:22,713 [Component  dispatcher] INFO  component.Component - 
> [COMPONENT sleep]: need upgrade to 1.0.0
> 2019-07-16 18:28:22,713 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE sleep-0 : 
> container_e62_1563179597798_0006_01_000008] spec state state changed from 
> FAILED_UPGRADE -> NEEDS_UPGRADE
> 2019-07-16 18:28:22,713 [Component  dispatcher] INFO  component.Component - 
> [COMPONENT sleep] Transitioned from UPGRADING to CANCEL_UPGRADING on 
> CANCEL_UPGRADE event.
> 2019-07-16 18:28:22,713 [Component  dispatcher] INFO  component.Component - 
> [COMPONENT sleep1]: need upgrade to 1.0.0
> 2019-07-16 18:28:22,714 [Component  dispatcher] INFO  component.Component - 
> [COMPONENT sleep1] Transitioned from UPGRADING to CANCEL_UPGRADING on 
> CANCEL_UPGRADE event.
> 2019-07-16 18:28:22,714 [Component  dispatcher] INFO  
> instance.ComponentInstance - container_e62_1563179597798_0006_01_000004 
> nothing to cancel
> 2019-07-16 18:28:22,714 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE sleep-2 : 
> container_e62_1563179597798_0006_01_000004] spec state state changed from 
> NEEDS_UPGRADE -> READY
> 2019-07-16 18:28:22,714 [Component  dispatcher] INFO  
> instance.ComponentInstance - container_e62_1563179597798_0006_01_000003 
> nothing to cancel
> 2019-07-16 18:28:22,714 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE sleep-1 : 
> container_e62_1563179597798_0006_01_000003] spec state state changed from 
> NEEDS_UPGRADE -> READY
> 2019-07-16 18:28:22,714 [Component  dispatcher] ERROR 
> service.ServiceScheduler - No component instance exists for 
> container_e62_1563179597798_0006_01_000008
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to