[ 
https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16467675#comment-16467675
 ] 

Billie Rinaldi commented on YARN-8243:
--------------------------------------

bq. Isn't that what we want?
Yes, I'm saying that removing the highest ID first is what we want, but that is 
not what the code is doing now. The compareTo method is checking the container 
start time, when it should only be checking the component instance ID. The way 
it is coded now, we have a bug (an additional bug beyond the one reported 
here). If the container start time order is comp-0, comp-2, comp-1, and we flex 
down by one, comp-1 would get removed and the instanceIdCounter would be 
decremented from 3 to 2. If we then flexed up by one, we would end up with 
comp-0, comp-2, comp-2. We need to fix this; it's a small change in compareTo. 
I am saying this fix would also address the specific issue you were seeing, but 
it would not address all possible cases of running instances being removed 
before pending instances.

I do not think we should remove pending instances as is proposed in this patch 
because it will cause "holes" in the component ID list. If we have comp-0, 
comp-1, comp-2 and comp-1 is pending, when we flex down comp-1 would be removed 
and we would be left with comp-0 and comp-2. If we flexed up, we would then 
have comp-0, comp-2, comp-3. I think we should always remove the instance with 
the highest ID.
 

> Flex down should first remove pending container requests (if any) and then 
> kill running containers
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-8243
>                 URL: https://issues.apache.org/jira/browse/YARN-8243
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: yarn-native-services
>    Affects Versions: 3.1.0
>            Reporter: Gour Saha
>            Assignee: Gour Saha
>            Priority: Major
>         Attachments: YARN-8243.01.patch
>
>
> This is easy to test on a service with anti-affinity component, to simulate 
> pending container requests. It can be simulated by other means also (no 
> resource left in cluster, etc.).
> Service yarnfile used to test this -
> {code:java}
> {
>   "name": "sleeper-service",
>   "version": "1",
>   "components" :
>   [
>     {
>       "name": "ping",
>       "number_of_containers": 2,
>       "resource": {
>         "cpus": 1,
>         "memory": "256"
>       },
>       "launch_command": "sleep 9000",
>       "placement_policy": {
>         "constraints": [
>           {
>             "type": "ANTI_AFFINITY",
>             "scope": "NODE",
>             "target_tags": [
>               "ping"
>             ]
>           }
>         ]
>       }
>     }
>   ]
> }
> {code}
> Launch a service with the above yarnfile as below -
> {code:java}
> yarn app -launch simple-aa-1 simple_AA.json
> {code}
> Let's assume there are only 5 nodes in this cluster. Now, flex the above 
> service to 1 extra container than the number of nodes (6 in my case).
> {code:java}
> yarn app -flex simple-aa-1 -component ping 6
> {code}
> Only 5 containers will be allocated and running for simple-aa-1. At this 
> point, flex it down to 5 containers -
> {code:java}
> yarn app -flex simple-aa-1 -component ping 5
> {code}
> This is what is seen in the serviceam log at this point -
> {noformat}
> 2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO  
> service.ClientAMService - Flexing component ping to 5
> 2018-05-03 20:17:38,469 [Component  dispatcher] INFO  component.Component - 
> [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5
> 2018-05-03 20:17:38,470 [Component  dispatcher] INFO  
> instance.ComponentInstance - [COMPINSTANCE ping-4 : 
> container_1525297086734_0013_01_000006]: Flexed down by user, destroying.
> 2018-05-03 20:17:38,473 [Component  dispatcher] INFO  component.Component - 
> [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event.
> 2018-05-03 20:17:38,474 [pool-5-thread-8] INFO  
> registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : 
> container_1525297086734_0013_01_000006]: Deleting registry path 
> /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-000006
> 2018-05-03 20:17:38,476 [Component  dispatcher] ERROR component.Component - 
> [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> CHECK_STABLE at STABLE
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>       at 
> org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913)
>       at 
> org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574)
>       at 
> org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
>       at java.lang.Thread.run(Thread.java:745)
> 2018-05-03 20:17:38,480 [Component  dispatcher] ERROR component.Component - 
> [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> CHECK_STABLE at STABLE
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>       at 
> org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913)
>       at 
> org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574)
>       at 
> org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
>       at java.lang.Thread.run(Thread.java:745)
> 2018-05-03 20:17:38,578 [pool-5-thread-8] INFO  instance.ComponentInstance - 
> [COMPINSTANCE ping-4 : container_1525297086734_0013_01_000006]: Deleted 
> component instance dir: 
> hdfs://ctr-e138-1518143905142-280820-01-000003.example.site:8020/user/root/.yarn/services/simple-aa-1/components/1/ping/ping-4
> 2018-05-03 20:17:39,268 [AMRM Callback Handler Thread] WARN  
> service.ServiceScheduler - Container container_1525297086734_0013_01_000006 
> Completed. No component instance exists. exitStatus=-100. 
> diagnostics=Container released by application 
> 2018-05-03 20:17:40,273 [AMRM Callback Handler Thread] INFO  
> service.ServiceScheduler - 1 containers allocated. 
> 2018-05-03 20:17:40,273 [AMRM Callback Handler Thread] INFO  
> service.ServiceScheduler - [COMPONENT ping]: remove 0 outstanding container 
> requests for allocateId 0
> 2018-05-03 20:17:40,274 [Component  dispatcher] INFO  component.Component - 
> [COMPONENT ping]: container_1525297086734_0013_01_000007 allocated, num 
> pending component instances reduced to 0
> 2018-05-03 20:17:40,274 [Component  dispatcher] INFO  component.Component - 
> [COMPONENT ping]: Assigned container_1525297086734_0013_01_000007 to 
> component instance ping-5 and launch on host 
> ctr-e138-1518143905142-280820-01-000008.example.site:25454 
> 2018-05-03 20:17:40,277 [pool-6-thread-6] INFO  provider.ProviderUtils - 
> [COMPINSTANCE ping-5 : container_1525297086734_0013_01_000007]: Creating dir 
> on hdfs: 
> hdfs://ctr-e138-1518143905142-280820-01-000003.example.site:8020/user/root/.yarn/services/simple-aa-1/components/1/ping/ping-5
> 2018-05-03 20:17:40,316 [pool-6-thread-6] INFO  
> containerlaunch.ContainerLaunchService - launching container 
> container_1525297086734_0013_01_000007
> 2018-05-03 20:17:40,318 
> [org.apache.hadoop.yarn.client.api.async.impl.NMClientAsyncImpl #5] INFO  
> impl.NMClientAsyncImpl - Processing Event EventType: START_CONTAINER for 
> Container container_1525297086734_0013_01_000007
> 2018-05-03 20:17:40,338 [Component  dispatcher] ERROR component.Component - 
> [COMPONENT ping]: Invalid event CONTAINER_STARTED at STABLE
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> CONTAINER_STARTED at STABLE
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>       at 
> org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913)
>       at 
> org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574)
>       at 
> org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
>       at java.lang.Thread.run(Thread.java:745)
> {noformat}
> Status response shows that only 4 containers are running and the service is 
> not in STABLE state -
> {code:java}
> yarn app -status simple-aa-1
> {code}
> output -
> {code:java}
> {
>     "components": [
>         {
>             "configuration": {
>                 "env": {},
>                 "files": [],
>                 "properties": {}
>             },
>             "containers": [
>                 {
>                     "bare_host": 
> "ctr-e138-1518143905142-280820-01-000007.example.site",
>                     "component_instance_name": "ping-1",
>                     "hostname": 
> "ctr-e138-1518143905142-280820-01-000007.example.site",
>                     "id": "container_1525297086734_0013_01_000003",
>                     "ip": "x.x.x.x",
>                     "launch_time": 1525378141535,
>                     "state": "READY"
>                 },
>                 {
>                     "bare_host": 
> "ctr-e138-1518143905142-280820-01-000006.example.site",
>                     "component_instance_name": "ping-0",
>                     "hostname": 
> "ctr-e138-1518143905142-280820-01-000006.example.site",
>                     "id": "container_1525297086734_0013_01_000002",
>                     "ip": "x.x.x.x",
>                     "launch_time": 1525378141513,
>                     "state": "READY"
>                 },
>                 {
>                     "bare_host": 
> "ctr-e138-1518143905142-280820-01-000005.example.site",
>                     "component_instance_name": "ping-3",
>                     "hostname": 
> "ctr-e138-1518143905142-280820-01-000005.example.site",
>                     "id": "container_1525297086734_0013_01_000005",
>                     "ip": "x.x.x.x",
>                     "launch_time": 1525378303429,
>                     "state": "READY"
>                 },
>                 {
>                     "bare_host": 
> "ctr-e138-1518143905142-280820-01-000004.example.site",
>                     "component_instance_name": "ping-2",
>                     "hostname": 
> "ctr-e138-1518143905142-280820-01-000004.example.site",
>                     "id": "container_1525297086734_0013_01_000004",
>                     "ip": "x.x.x.x",
>                     "launch_time": 1525378303425,
>                     "state": "READY"
>                 }
>             ],
>             "dependencies": [],
>             "launch_command": "sleep 9000",
>             "name": "ping",
>             "number_of_containers": 5,
>             "placement_policy": {
>                 "constraints": [
>                     {
>                         "node_attributes": {},
>                         "node_partitions": [],
>                         "scope": "NODE",
>                         "target_tags": [
>                             "ping"
>                         ],
>                         "type": "ANTI_AFFINITY"
>                     }
>                 ]
>             },
>             "quicklinks": [],
>             "resource": {
>                 "additional": {},
>                 "cpus": 1,
>                 "memory": "256"
>             },
>             "run_privileged_container": false,
>             "state": "FLEXING"
>         }
>     ],
>     "configuration": {
>         "env": {},
>         "files": [],
>         "properties": {}
>     },
>     "id": "application_1525297086734_0013",
>     "kerberos_principal": {},
>     "lifetime": -1,
>     "name": "simple-aa-1",
>     "quicklinks": {},
>     "state": "STARTED",
>     "version": "1"
> }
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to