[ 
https://issues.apache.org/jira/browse/YARN-7939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16437696#comment-16437696
 ] 

Eric Yang commented on YARN-7939:
---------------------------------

[~csingh] With tarball application, patch 007 shows this diagnostics info when 
trigger instances to upgrade:
{code}
AM Container for appattempt_1523637126365_0003_000001 exited with exitCode: -100
Failing this attempt.Diagnostics: Container released on a *lost* nodeFor more 
detailed output, check the application tracking page: 
http://eyang-2.openstacklocal:8088/cluster/app/application_1523637126365_0003 
Then click on links to logs of each attempt.
{code}

AM log file shows:
{code}
2018-04-13 17:04:53,019 [AMRM Callback Handler Thread] WARN  
service.ServiceScheduler - Nodes updated info: 
eyang-3.openstacklocal:49921, state = UNHEALTHY, healthDiagnostics = Linux 
Container Executor reached unrecoverable exception

2018-04-13 17:04:53,022 [AMRM Callback Handler Thread] WARN  
service.ServiceScheduler - Container container_1523637126365_0003_01_000002 
Completed. No component instance exists. exitStatus=-100. diagnostics=Container 
released on a *lost* node 
2018-04-13 17:04:53,041 [main] INFO  registry.YarnRegistryViewForProviders - 
Resolving path 
/users/hbase/services/yarn-service/abc/components/ctr-1523637126365-0003-01-000002
2018-04-13 17:04:53,043 [main] INFO  service.ServiceScheduler - Handling 
container_1523637126365_0003_01_000003 from previous attempt
2018-04-13 17:04:53,045 [main] INFO  component.Component - [COMPONENT ping]: 
Recovered container_1523637126365_0003_01_000003 for component instance ping-1 
on host eyang-4.openstacklocal:49039, num pending component instances reduced 
to 1 
2018-04-13 17:04:53,047 [main] INFO  service.ServiceScheduler - Triggering 
initial evaluation of component ping
2018-04-13 17:04:53,047 [main] INFO  component.Component - [INIT COMPONENT 
ping]: 2 instances.
2018-04-13 17:04:53,047 [main] INFO  component.Component - [COMPONENT ping] 
Requesting for 0 container(s)
2018-04-13 17:04:53,048 [main] INFO  component.Component - [COMPONENT ping] 
Transitioned from INIT to FLEXING on FLEX event.
2018-04-13 17:04:53,049 [pool-5-thread-1] INFO  service.ServiceScheduler - 
Registered service under /users/hbase/services/yarn-service/abc; absolute path 
/registry/users/hbase/services/yarn-service/abc
2018-04-13 17:04:53,062 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE ping-1 : 
container_1523637126365_0003_01_000003] Transitioned from INIT to STARTED on 
START event
2018-04-13 17:04:53,127 [pool-5-thread-1] INFO  instance.ComponentInstance - 
[COMPINSTANCE ping-1 : container_1523637126365_0003_01_000003] IP = 
[172.26.111.21], host = eyang-4.openstacklocal, cancel container status 
retriever
2018-04-13 17:05:23,055 [pool-7-thread-1] INFO  instance.ComponentInstance - 
[COMPINSTANCE ping-1 : container_1523637126365_0003_01_000003] Transitioned 
from STARTED to READY on BECOME_READY event
2018-04-13 17:06:53,055 [pool-5-thread-3] INFO  service.ServiceScheduler - 
[COMPINSTANCE ping-0], wait on container container_1523637126365_0003_01_000002 
expired
2018-04-13 17:06:53,056 [pool-5-thread-3] INFO  
registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-0]: Deleting 
registry path 
/users/hbase/services/yarn-service/abc/components/ctr-1523637126365-0003-01-000002
2018-04-13 17:06:53,088 [pool-5-thread-3] INFO  component.Component - 
[COMPONENT ping] Requesting for 1 container(s)
2018-04-13 17:06:53,099 [pool-5-thread-3] INFO  component.Component - 
[COMPONENT ping] Submitting container request : Capability[<memory:256, 
vCores:1>]Priority[0]AllocationRequestId[0]ExecutionTypeRequest[{Execution 
Type: GUARANTEED, Enforce Execution Type: false}]Resource Profile[null]
2018-04-13 17:06:54,507 [AMRM Callback Handler Thread] INFO  
service.ServiceScheduler - 1 containers allocated. 
2018-04-13 17:06:54,508 [Component  dispatcher] INFO  component.Component - 
[COMPONENT ping]: container_1523637126365_0003_02_000002 allocated, num pending 
component instances reduced to 0
2018-04-13 17:06:54,508 [Component  dispatcher] INFO  component.Component - 
[COMPONENT ping]: Assigned container_1523637126365_0003_02_000002 to component 
instance ping-0 and launch on host eyang-4.openstacklocal:49039 
2018-04-13 17:06:54,509 [AMRM Callback Handler Thread] INFO  
service.ServiceScheduler - [COMPONENT ping]: remove 1 outstanding container 
requests for allocateId 0
2018-04-13 17:06:54,546 [pool-6-thread-1] INFO  provider.ProviderUtils - 
Component instance conf dir already exists: 
hdfs://eyang-1.openstacklocal:9000/user/hbase/.yarn/services/abc/components/v1/ping/ping-0
2018-04-13 17:06:54,551 [pool-6-thread-1] INFO  
containerlaunch.ContainerLaunchService - launching container 
container_1523637126365_0003_02_000002
2018-04-13 17:06:54,561 
[org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #0] INFO  
impl.NMClientAsyncImpl - Processing Event EventType: START_CONTAINER for 
Container container_1523637126365_0003_02_000002
2018-04-13 17:06:54,600 [Component  dispatcher] INFO  
instance.ComponentInstance - [COMPINSTANCE ping-0 : 
container_1523637126365_0003_02_000002] Transitioned from INIT to STARTED on 
START event
2018-04-13 17:06:56,647 [pool-5-thread-3] INFO  instance.ComponentInstance - 
[COMPINSTANCE ping-0 : container_1523637126365_0003_02_000002] IP = 
[172.26.111.21], host = eyang-4.openstacklocal, cancel container status 
retriever
2018-04-13 17:07:06,138 [Socket Reader #1 for port 44836] INFO  ipc.Server - 
Auth successful for rm/eyang-2.openstacklo...@example.com (auth:KERBEROS)
2018-04-13 17:07:06,171 [Socket Reader #1 for port 44836] INFO  
authorize.ServiceAuthorizationManager - Authorization successful for hbase 
(auth:PROXY) via rm/eyang-2.openstacklo...@example.com (auth:KERBEROS) for 
protocol=interface org.apache.hadoop.yarn.service.ClientAMProtocol
2018-04-13 17:07:23,049 [pool-7-thread-1] INFO  component.Component - 
[COMPONENT ping] state changed from FLEXING -> STABLE
2018-04-13 17:07:23,050 [pool-7-thread-1] INFO  service.ServiceMaster - Service 
state changed from STARTED -> STABLE
2018-04-13 17:07:23,050 [pool-7-thread-1] INFO  instance.ComponentInstance - 
[COMPINSTANCE ping-0 : container_1523637126365_0003_02_000002] Transitioned 
from STARTED to READY on BECOME_READY event
2018-04-13 17:07:23,050 [Component  dispatcher] INFO  component.Component - 
[COMPONENT ping] Transitioned from FLEXING to STABLE on CHECK_STABLE event.
{code}

JSON stored in HDFS .yarn/services/abc/upgrade/v2/abc.json have the wrong 
component state: flexing, even AM has updated the state from FLEXING to STABLE. 
 I am unable to finalize the upgrade with this error message:

{code}
$ ./bin/yarn app -upgrade abc -finalize
2018-04-13 17:23:08,321 ERROR client.ApiServiceClient: Failed to start service 
abc, because it already exists.
{code}

I think the state in HDFS is out of sync with AM because AM logic does not 
automatically updates the copy in .yarn/services/abc/abc.json in some 
conditions.  This looks like bug to me.  When upgrade logic persisted spec file 
in .yarn/services/abc/upgrade/v2/abc.json by using the copy in hdfs instead of 
AM.  This can cause some problems.  The component state shows "FLEXING" in 
.yarn/services/abc/upgrade/v2/abc.json when data is merged from HDFS and apply 
upgrade spec then persisted on disk.  AM is responsible for writing the in 
memory state and merged with upgrade request, then write to HDFS to ensure that 
latest state is persisted.

> Yarn Service Upgrade: add support to upgrade a component instance 
> ------------------------------------------------------------------
>
>                 Key: YARN-7939
>                 URL: https://issues.apache.org/jira/browse/YARN-7939
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Chandni Singh
>            Assignee: Chandni Singh
>            Priority: Major
>         Attachments: YARN-7939.001.patch, YARN-7939.002.patch, 
> YARN-7939.003.patch, YARN-7939.004.patch, YARN-7939.005.patch, 
> YARN-7939.006.patch, YARN-7939.007.patch
>
>
> Yarn core supports in-place upgrade of containers. A yarn service can 
> leverage that to provide in-place upgrade of component instances. Please see 
> YARN-7512 for details.
> Will add support to upgrade a single component instance first and then 
> iteratively add other APIs and features.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to