[jira] [Commented] (YARN-8243) Flex down should first remove pending container requests (if any) and then kill running containers
[ https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471269#comment-16471269 ] genericqa commented on YARN-8243: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 27s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 22m 58s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 26s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 18s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 28s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 3s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 36s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 17s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 26s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 21s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 14s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 10m 27s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 42s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 55s{color} | {color:green} hadoop-yarn-services-core in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 59m 56s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-8243 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12922924/YARN-8243.02.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 589f60b31c85 4.4.0-89-generic #112-Ubuntu SMP Mon Jul 31 19:38:41 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 7369f41 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_162 | | findbugs | v3.1.0-RC1 | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/20692/testReport/ | | Max. process+thread count | 890 (vs. ulimit of 1) | | modules | C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core | | Console output | https://builds.apache.org/job/PreCommit-YARN-Build/20692/console | | Powered by | Apache Yetus 0.8.0-SNAPSHOT http://yetus.apache.org | This message was automatically generated. > Flex down should first
[jira] [Commented] (YARN-8243) Flex down should first remove pending container requests (if any) and then kill running containers
[ https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471196#comment-16471196 ] Gour Saha commented on YARN-8243: - Ah, I know what you are saying [~billie.rinaldi]. I modified the patch accordingly. Please review. > Flex down should first remove pending container requests (if any) and then > kill running containers > -- > > Key: YARN-8243 > URL: https://issues.apache.org/jira/browse/YARN-8243 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Gour Saha >Assignee: Gour Saha >Priority: Major > Attachments: YARN-8243.01.patch > > > This is easy to test on a service with anti-affinity component, to simulate > pending container requests. It can be simulated by other means also (no > resource left in cluster, etc.). > Service yarnfile used to test this - > {code:java} > { > "name": "sleeper-service", > "version": "1", > "components" : > [ > { > "name": "ping", > "number_of_containers": 2, > "resource": { > "cpus": 1, > "memory": "256" > }, > "launch_command": "sleep 9000", > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "ping" > ] > } > ] > } > } > ] > } > {code} > Launch a service with the above yarnfile as below - > {code:java} > yarn app -launch simple-aa-1 simple_AA.json > {code} > Let's assume there are only 5 nodes in this cluster. Now, flex the above > service to 1 extra container than the number of nodes (6 in my case). > {code:java} > yarn app -flex simple-aa-1 -component ping 6 > {code} > Only 5 containers will be allocated and running for simple-aa-1. At this > point, flex it down to 5 containers - > {code:java} > yarn app -flex simple-aa-1 -component ping 5 > {code} > This is what is seen in the serviceam log at this point - > {noformat} > 2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO > service.ClientAMService - Flexing component ping to 5 > 2018-05-03 20:17:38,469 [Component dispatcher] INFO component.Component - > [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5 > 2018-05-03 20:17:38,470 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Flexed down by user, destroying. > 2018-05-03 20:17:38,473 [Component dispatcher] INFO component.Component - > [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event. > 2018-05-03 20:17:38,474 [pool-5-thread-8] INFO > registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Deleting registry path > /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-06 > 2018-05-03 20:17:38,476 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > 2018-05-03 20:17:38,480 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at >
[jira] [Commented] (YARN-8243) Flex down should first remove pending container requests (if any) and then kill running containers
[ https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469368#comment-16469368 ] Billie Rinaldi commented on YARN-8243: -- I strongly feel we should not make the changes to FlexComponentTransition suggested in this patch because it will create holes. We should only make the compareTo fix, which will still allow the new unit test to pass. > Flex down should first remove pending container requests (if any) and then > kill running containers > -- > > Key: YARN-8243 > URL: https://issues.apache.org/jira/browse/YARN-8243 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Gour Saha >Assignee: Gour Saha >Priority: Major > Attachments: YARN-8243.01.patch > > > This is easy to test on a service with anti-affinity component, to simulate > pending container requests. It can be simulated by other means also (no > resource left in cluster, etc.). > Service yarnfile used to test this - > {code:java} > { > "name": "sleeper-service", > "version": "1", > "components" : > [ > { > "name": "ping", > "number_of_containers": 2, > "resource": { > "cpus": 1, > "memory": "256" > }, > "launch_command": "sleep 9000", > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "ping" > ] > } > ] > } > } > ] > } > {code} > Launch a service with the above yarnfile as below - > {code:java} > yarn app -launch simple-aa-1 simple_AA.json > {code} > Let's assume there are only 5 nodes in this cluster. Now, flex the above > service to 1 extra container than the number of nodes (6 in my case). > {code:java} > yarn app -flex simple-aa-1 -component ping 6 > {code} > Only 5 containers will be allocated and running for simple-aa-1. At this > point, flex it down to 5 containers - > {code:java} > yarn app -flex simple-aa-1 -component ping 5 > {code} > This is what is seen in the serviceam log at this point - > {noformat} > 2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO > service.ClientAMService - Flexing component ping to 5 > 2018-05-03 20:17:38,469 [Component dispatcher] INFO component.Component - > [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5 > 2018-05-03 20:17:38,470 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Flexed down by user, destroying. > 2018-05-03 20:17:38,473 [Component dispatcher] INFO component.Component - > [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event. > 2018-05-03 20:17:38,474 [pool-5-thread-8] INFO > registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Deleting registry path > /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-06 > 2018-05-03 20:17:38,476 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > 2018-05-03 20:17:38,480 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at >
[jira] [Commented] (YARN-8243) Flex down should first remove pending container requests (if any) and then kill running containers
[ https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469360#comment-16469360 ] Gour Saha commented on YARN-8243: - bq. The only situation the compareTo fix would not address is when comp0 and comp2 have containers assigned and comp1 is pending. A flex down in that case would kill comp2 first. I believe this is the behavior we want in order to prevent holes. I don't think this can happen after this patch and the compareTo fix. Am I missing something? > Flex down should first remove pending container requests (if any) and then > kill running containers > -- > > Key: YARN-8243 > URL: https://issues.apache.org/jira/browse/YARN-8243 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Gour Saha >Assignee: Gour Saha >Priority: Major > Attachments: YARN-8243.01.patch > > > This is easy to test on a service with anti-affinity component, to simulate > pending container requests. It can be simulated by other means also (no > resource left in cluster, etc.). > Service yarnfile used to test this - > {code:java} > { > "name": "sleeper-service", > "version": "1", > "components" : > [ > { > "name": "ping", > "number_of_containers": 2, > "resource": { > "cpus": 1, > "memory": "256" > }, > "launch_command": "sleep 9000", > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "ping" > ] > } > ] > } > } > ] > } > {code} > Launch a service with the above yarnfile as below - > {code:java} > yarn app -launch simple-aa-1 simple_AA.json > {code} > Let's assume there are only 5 nodes in this cluster. Now, flex the above > service to 1 extra container than the number of nodes (6 in my case). > {code:java} > yarn app -flex simple-aa-1 -component ping 6 > {code} > Only 5 containers will be allocated and running for simple-aa-1. At this > point, flex it down to 5 containers - > {code:java} > yarn app -flex simple-aa-1 -component ping 5 > {code} > This is what is seen in the serviceam log at this point - > {noformat} > 2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO > service.ClientAMService - Flexing component ping to 5 > 2018-05-03 20:17:38,469 [Component dispatcher] INFO component.Component - > [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5 > 2018-05-03 20:17:38,470 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Flexed down by user, destroying. > 2018-05-03 20:17:38,473 [Component dispatcher] INFO component.Component - > [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event. > 2018-05-03 20:17:38,474 [pool-5-thread-8] INFO > registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Deleting registry path > /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-06 > 2018-05-03 20:17:38,476 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > 2018-05-03 20:17:38,480 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at >
[jira] [Commented] (YARN-8243) Flex down should first remove pending container requests (if any) and then kill running containers
[ https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469120#comment-16469120 ] Billie Rinaldi commented on YARN-8243: -- bq. Let's say a service owner flexes up her service by 2 extra containers. The containers don't get allocated for a long time (because of no resource left in the cluster or something else). Meanwhile, the service owner decides to go down back to the original number of instances because she does not need those 2 additional containers. At this point, she expects no change to her service but then sees 2 containers killed. It will not be a desirable experience. That won't happen after we fix the compareTo method. The 2 new instances will be the ones that are removed, because they will have the highest IDs. The highest IDs will be removed first regardless of whether they are pending or running. The only situation the compareTo fix would not address is when comp0 and comp2 have containers assigned and comp1 is pending. A flex down in that case would kill comp2 first. I believe this is the behavior we want in order to prevent holes. > Flex down should first remove pending container requests (if any) and then > kill running containers > -- > > Key: YARN-8243 > URL: https://issues.apache.org/jira/browse/YARN-8243 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Gour Saha >Assignee: Gour Saha >Priority: Major > Attachments: YARN-8243.01.patch > > > This is easy to test on a service with anti-affinity component, to simulate > pending container requests. It can be simulated by other means also (no > resource left in cluster, etc.). > Service yarnfile used to test this - > {code:java} > { > "name": "sleeper-service", > "version": "1", > "components" : > [ > { > "name": "ping", > "number_of_containers": 2, > "resource": { > "cpus": 1, > "memory": "256" > }, > "launch_command": "sleep 9000", > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "ping" > ] > } > ] > } > } > ] > } > {code} > Launch a service with the above yarnfile as below - > {code:java} > yarn app -launch simple-aa-1 simple_AA.json > {code} > Let's assume there are only 5 nodes in this cluster. Now, flex the above > service to 1 extra container than the number of nodes (6 in my case). > {code:java} > yarn app -flex simple-aa-1 -component ping 6 > {code} > Only 5 containers will be allocated and running for simple-aa-1. At this > point, flex it down to 5 containers - > {code:java} > yarn app -flex simple-aa-1 -component ping 5 > {code} > This is what is seen in the serviceam log at this point - > {noformat} > 2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO > service.ClientAMService - Flexing component ping to 5 > 2018-05-03 20:17:38,469 [Component dispatcher] INFO component.Component - > [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5 > 2018-05-03 20:17:38,470 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Flexed down by user, destroying. > 2018-05-03 20:17:38,473 [Component dispatcher] INFO component.Component - > [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event. > 2018-05-03 20:17:38,474 [pool-5-thread-8] INFO > registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Deleting registry path > /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-06 > 2018-05-03 20:17:38,476 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) > at >
[jira] [Commented] (YARN-8243) Flex down should first remove pending container requests (if any) and then kill running containers
[ https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469022#comment-16469022 ] Gour Saha commented on YARN-8243: - bq. component.pendingInstances.remove(instance); on line 358 may not be necessary since either delta will be 0 or there will be no more pending instances? [~suma.shivaprasad], good point. I will make the change in the next patch. > Flex down should first remove pending container requests (if any) and then > kill running containers > -- > > Key: YARN-8243 > URL: https://issues.apache.org/jira/browse/YARN-8243 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Gour Saha >Assignee: Gour Saha >Priority: Major > Attachments: YARN-8243.01.patch > > > This is easy to test on a service with anti-affinity component, to simulate > pending container requests. It can be simulated by other means also (no > resource left in cluster, etc.). > Service yarnfile used to test this - > {code:java} > { > "name": "sleeper-service", > "version": "1", > "components" : > [ > { > "name": "ping", > "number_of_containers": 2, > "resource": { > "cpus": 1, > "memory": "256" > }, > "launch_command": "sleep 9000", > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "ping" > ] > } > ] > } > } > ] > } > {code} > Launch a service with the above yarnfile as below - > {code:java} > yarn app -launch simple-aa-1 simple_AA.json > {code} > Let's assume there are only 5 nodes in this cluster. Now, flex the above > service to 1 extra container than the number of nodes (6 in my case). > {code:java} > yarn app -flex simple-aa-1 -component ping 6 > {code} > Only 5 containers will be allocated and running for simple-aa-1. At this > point, flex it down to 5 containers - > {code:java} > yarn app -flex simple-aa-1 -component ping 5 > {code} > This is what is seen in the serviceam log at this point - > {noformat} > 2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO > service.ClientAMService - Flexing component ping to 5 > 2018-05-03 20:17:38,469 [Component dispatcher] INFO component.Component - > [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5 > 2018-05-03 20:17:38,470 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Flexed down by user, destroying. > 2018-05-03 20:17:38,473 [Component dispatcher] INFO component.Component - > [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event. > 2018-05-03 20:17:38,474 [pool-5-thread-8] INFO > registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Deleting registry path > /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-06 > 2018-05-03 20:17:38,476 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > 2018-05-03 20:17:38,480 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at >
[jira] [Commented] (YARN-8243) Flex down should first remove pending container requests (if any) and then kill running containers
[ https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16469020#comment-16469020 ] Gour Saha commented on YARN-8243: - bq. The compareTo method is checking the container start time, when it should only be checking the component instance ID. [~billie.rinaldi], I agree to this. We should make compareTo explicitly check the instance ID. Fortunately, the way component instance objects are created and then allocated containers are assigned, looks like the start time order is same as the instance ID order. bq. I do not think we should remove pending instances as is proposed in this patch because it will cause "holes" in the component ID list. If we have comp-0, comp-1, comp-2 and comp-1 is pending, when we flex down comp-1 would be removed and we would be left with comp-0 and comp-2. If we flexed up, we would then have comp-0, comp-2, comp-3. I think we should always remove the instance with the highest ID. Once we make the compareTo change as suggested above, this situation will not occur. Nevertheless, say even if we couldn't programmatically fix the holes issue, we should not remove a running instance when there is a pending instance. Let's say a service owner flexes up her service by 2 extra containers. The containers don't get allocated for a long time (because of no resource left in the cluster or something else). Meanwhile, the service owner decides to go down back to the original number of instances because she does not need those 2 additional containers. At this point, she expects no change to her service but then sees 2 containers killed. It will not be a desirable experience. > Flex down should first remove pending container requests (if any) and then > kill running containers > -- > > Key: YARN-8243 > URL: https://issues.apache.org/jira/browse/YARN-8243 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Gour Saha >Assignee: Gour Saha >Priority: Major > Attachments: YARN-8243.01.patch > > > This is easy to test on a service with anti-affinity component, to simulate > pending container requests. It can be simulated by other means also (no > resource left in cluster, etc.). > Service yarnfile used to test this - > {code:java} > { > "name": "sleeper-service", > "version": "1", > "components" : > [ > { > "name": "ping", > "number_of_containers": 2, > "resource": { > "cpus": 1, > "memory": "256" > }, > "launch_command": "sleep 9000", > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "ping" > ] > } > ] > } > } > ] > } > {code} > Launch a service with the above yarnfile as below - > {code:java} > yarn app -launch simple-aa-1 simple_AA.json > {code} > Let's assume there are only 5 nodes in this cluster. Now, flex the above > service to 1 extra container than the number of nodes (6 in my case). > {code:java} > yarn app -flex simple-aa-1 -component ping 6 > {code} > Only 5 containers will be allocated and running for simple-aa-1. At this > point, flex it down to 5 containers - > {code:java} > yarn app -flex simple-aa-1 -component ping 5 > {code} > This is what is seen in the serviceam log at this point - > {noformat} > 2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO > service.ClientAMService - Flexing component ping to 5 > 2018-05-03 20:17:38,469 [Component dispatcher] INFO component.Component - > [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5 > 2018-05-03 20:17:38,470 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Flexed down by user, destroying. > 2018-05-03 20:17:38,473 [Component dispatcher] INFO component.Component - > [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event. > 2018-05-03 20:17:38,474 [pool-5-thread-8] INFO > registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Deleting registry path > /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-06 > 2018-05-03 20:17:38,476 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at >
[jira] [Commented] (YARN-8243) Flex down should first remove pending container requests (if any) and then kill running containers
[ https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16468104#comment-16468104 ] Suma Shivaprasad commented on YARN-8243: [~gsaha] component.pendingInstances.remove(instance); on line 358 may not be necessary since either delta will be 0 or there will be no more pending instances? > Flex down should first remove pending container requests (if any) and then > kill running containers > -- > > Key: YARN-8243 > URL: https://issues.apache.org/jira/browse/YARN-8243 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Gour Saha >Assignee: Gour Saha >Priority: Major > Attachments: YARN-8243.01.patch > > > This is easy to test on a service with anti-affinity component, to simulate > pending container requests. It can be simulated by other means also (no > resource left in cluster, etc.). > Service yarnfile used to test this - > {code:java} > { > "name": "sleeper-service", > "version": "1", > "components" : > [ > { > "name": "ping", > "number_of_containers": 2, > "resource": { > "cpus": 1, > "memory": "256" > }, > "launch_command": "sleep 9000", > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "ping" > ] > } > ] > } > } > ] > } > {code} > Launch a service with the above yarnfile as below - > {code:java} > yarn app -launch simple-aa-1 simple_AA.json > {code} > Let's assume there are only 5 nodes in this cluster. Now, flex the above > service to 1 extra container than the number of nodes (6 in my case). > {code:java} > yarn app -flex simple-aa-1 -component ping 6 > {code} > Only 5 containers will be allocated and running for simple-aa-1. At this > point, flex it down to 5 containers - > {code:java} > yarn app -flex simple-aa-1 -component ping 5 > {code} > This is what is seen in the serviceam log at this point - > {noformat} > 2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO > service.ClientAMService - Flexing component ping to 5 > 2018-05-03 20:17:38,469 [Component dispatcher] INFO component.Component - > [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5 > 2018-05-03 20:17:38,470 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Flexed down by user, destroying. > 2018-05-03 20:17:38,473 [Component dispatcher] INFO component.Component - > [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event. > 2018-05-03 20:17:38,474 [pool-5-thread-8] INFO > registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Deleting registry path > /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-06 > 2018-05-03 20:17:38,476 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > 2018-05-03 20:17:38,480 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at >
[jira] [Commented] (YARN-8243) Flex down should first remove pending container requests (if any) and then kill running containers
[ https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467675#comment-16467675 ] Billie Rinaldi commented on YARN-8243: -- bq. Isn't that what we want? Yes, I'm saying that removing the highest ID first is what we want, but that is not what the code is doing now. The compareTo method is checking the container start time, when it should only be checking the component instance ID. The way it is coded now, we have a bug (an additional bug beyond the one reported here). If the container start time order is comp-0, comp-2, comp-1, and we flex down by one, comp-1 would get removed and the instanceIdCounter would be decremented from 3 to 2. If we then flexed up by one, we would end up with comp-0, comp-2, comp-2. We need to fix this; it's a small change in compareTo. I am saying this fix would also address the specific issue you were seeing, but it would not address all possible cases of running instances being removed before pending instances. I do not think we should remove pending instances as is proposed in this patch because it will cause "holes" in the component ID list. If we have comp-0, comp-1, comp-2 and comp-1 is pending, when we flex down comp-1 would be removed and we would be left with comp-0 and comp-2. If we flexed up, we would then have comp-0, comp-2, comp-3. I think we should always remove the instance with the highest ID. > Flex down should first remove pending container requests (if any) and then > kill running containers > -- > > Key: YARN-8243 > URL: https://issues.apache.org/jira/browse/YARN-8243 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Gour Saha >Assignee: Gour Saha >Priority: Major > Attachments: YARN-8243.01.patch > > > This is easy to test on a service with anti-affinity component, to simulate > pending container requests. It can be simulated by other means also (no > resource left in cluster, etc.). > Service yarnfile used to test this - > {code:java} > { > "name": "sleeper-service", > "version": "1", > "components" : > [ > { > "name": "ping", > "number_of_containers": 2, > "resource": { > "cpus": 1, > "memory": "256" > }, > "launch_command": "sleep 9000", > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "ping" > ] > } > ] > } > } > ] > } > {code} > Launch a service with the above yarnfile as below - > {code:java} > yarn app -launch simple-aa-1 simple_AA.json > {code} > Let's assume there are only 5 nodes in this cluster. Now, flex the above > service to 1 extra container than the number of nodes (6 in my case). > {code:java} > yarn app -flex simple-aa-1 -component ping 6 > {code} > Only 5 containers will be allocated and running for simple-aa-1. At this > point, flex it down to 5 containers - > {code:java} > yarn app -flex simple-aa-1 -component ping 5 > {code} > This is what is seen in the serviceam log at this point - > {noformat} > 2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO > service.ClientAMService - Flexing component ping to 5 > 2018-05-03 20:17:38,469 [Component dispatcher] INFO component.Component - > [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5 > 2018-05-03 20:17:38,470 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Flexed down by user, destroying. > 2018-05-03 20:17:38,473 [Component dispatcher] INFO component.Component - > [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event. > 2018-05-03 20:17:38,474 [pool-5-thread-8] INFO > registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Deleting registry path > /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-06 > 2018-05-03 20:17:38,476 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at >
[jira] [Commented] (YARN-8243) Flex down should first remove pending container requests (if any) and then kill running containers
[ https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16467033#comment-16467033 ] genericqa commented on YARN-8243: - | (/) *{color:green}+1 overall{color}* | \\ \\ || Vote || Subsystem || Runtime || Comment || | {color:blue}0{color} | {color:blue} reexec {color} | {color:blue} 0m 23s{color} | {color:blue} Docker mode activated. {color} | || || || || {color:brown} Prechecks {color} || | {color:green}+1{color} | {color:green} @author {color} | {color:green} 0m 0s{color} | {color:green} The patch does not contain any @author tags. {color} | | {color:green}+1{color} | {color:green} test4tests {color} | {color:green} 0m 0s{color} | {color:green} The patch appears to include 1 new or modified test files. {color} | || || || || {color:brown} trunk Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 25m 8s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 29s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} checkstyle {color} | {color:green} 0m 19s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 33s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 11m 38s{color} | {color:green} branch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 38s{color} | {color:green} trunk passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 15s{color} | {color:green} trunk passed {color} | || || || || {color:brown} Patch Compile Tests {color} || | {color:green}+1{color} | {color:green} mvninstall {color} | {color:green} 0m 27s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} compile {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javac {color} | {color:green} 0m 23s{color} | {color:green} the patch passed {color} | | {color:orange}-0{color} | {color:orange} checkstyle {color} | {color:orange} 0m 13s{color} | {color:orange} hadoop-yarn-project/hadoop-yarn/hadoop-yarn-applications/hadoop-yarn-services/hadoop-yarn-services-core: The patch generated 1 new + 31 unchanged - 0 fixed = 32 total (was 31) {color} | | {color:green}+1{color} | {color:green} mvnsite {color} | {color:green} 0m 24s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} whitespace {color} | {color:green} 0m 0s{color} | {color:green} The patch has no whitespace issues. {color} | | {color:green}+1{color} | {color:green} shadedclient {color} | {color:green} 12m 2s{color} | {color:green} patch has no errors when building and testing our client artifacts. {color} | | {color:green}+1{color} | {color:green} findbugs {color} | {color:green} 0m 49s{color} | {color:green} the patch passed {color} | | {color:green}+1{color} | {color:green} javadoc {color} | {color:green} 0m 14s{color} | {color:green} the patch passed {color} | || || || || {color:brown} Other Tests {color} || | {color:green}+1{color} | {color:green} unit {color} | {color:green} 10m 44s{color} | {color:green} hadoop-yarn-services-core in the patch passed. {color} | | {color:green}+1{color} | {color:green} asflicense {color} | {color:green} 0m 23s{color} | {color:green} The patch does not generate ASF License warnings. {color} | | {color:black}{color} | {color:black} {color} | {color:black} 65m 11s{color} | {color:black} {color} | \\ \\ || Subsystem || Report/Notes || | Docker | Client=17.05.0-ce Server=17.05.0-ce Image:yetus/hadoop:abb62dd | | JIRA Issue | YARN-8243 | | JIRA Patch URL | https://issues.apache.org/jira/secure/attachment/12922399/YARN-8243.01.patch | | Optional Tests | asflicense compile javac javadoc mvninstall mvnsite unit shadedclient findbugs checkstyle | | uname | Linux 1e08a92c808e 3.13.0-137-generic #186-Ubuntu SMP Mon Dec 4 19:09:19 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | /testptch/patchprocess/precommit/personality/provided.sh | | git revision | trunk / 2916c90 | | maven | version: Apache Maven 3.3.9 | | Default Java | 1.8.0_162 | | findbugs | v3.1.0-RC1 | | checkstyle | https://builds.apache.org/job/PreCommit-YARN-Build/20629/artifact/out/diff-checkstyle-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-applications_hadoop-yarn-services_hadoop-yarn-services-core.txt | | Test Results | https://builds.apache.org/job/PreCommit-YARN-Build/20629/testReport/ | | Max. process+thread count | 762 (vs. ulimit of 1) | | modules | C:
[jira] [Commented] (YARN-8243) Flex down should first remove pending container requests (if any) and then kill running containers
[ https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466924#comment-16466924 ] Gour Saha commented on YARN-8243: - bq. It seems that, based on the component instance ID assignment, we will always have to remove the instance with the highest ID first. Isn't that what we want? I put up a 01 patch which fixes the issues. For a flex down when it actually has to destroy a container, it always chooses the highest value of n for a component of name "ping" and component instance of the format "ping-n". [~billie.rinaldi], are you thinking of exposing the sort order to the service owner as a configurable comparator instead of the dev chosen default? > Flex down should first remove pending container requests (if any) and then > kill running containers > -- > > Key: YARN-8243 > URL: https://issues.apache.org/jira/browse/YARN-8243 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Gour Saha >Assignee: Gour Saha >Priority: Major > > This is easy to test on a service with anti-affinity component, to simulate > pending container requests. It can be simulated by other means also (no > resource left in cluster, etc.). > Service yarnfile used to test this - > {code:java} > { > "name": "sleeper-service", > "version": "1", > "components" : > [ > { > "name": "ping", > "number_of_containers": 2, > "resource": { > "cpus": 1, > "memory": "256" > }, > "launch_command": "sleep 9000", > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "ping" > ] > } > ] > } > } > ] > } > {code} > Launch a service with the above yarnfile as below - > {code:java} > yarn app -launch simple-aa-1 simple_AA.json > {code} > Let's assume there are only 5 nodes in this cluster. Now, flex the above > service to 1 extra container than the number of nodes (6 in my case). > {code:java} > yarn app -flex simple-aa-1 -component ping 6 > {code} > Only 5 containers will be allocated and running for simple-aa-1. At this > point, flex it down to 5 containers - > {code:java} > yarn app -flex simple-aa-1 -component ping 5 > {code} > This is what is seen in the serviceam log at this point - > {noformat} > 2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO > service.ClientAMService - Flexing component ping to 5 > 2018-05-03 20:17:38,469 [Component dispatcher] INFO component.Component - > [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5 > 2018-05-03 20:17:38,470 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Flexed down by user, destroying. > 2018-05-03 20:17:38,473 [Component dispatcher] INFO component.Component - > [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event. > 2018-05-03 20:17:38,474 [pool-5-thread-8] INFO > registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Deleting registry path > /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-06 > 2018-05-03 20:17:38,476 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > 2018-05-03 20:17:38,480 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE >
[jira] [Commented] (YARN-8243) Flex down should first remove pending container requests (if any) and then kill running containers
[ https://issues.apache.org/jira/browse/YARN-8243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16466259#comment-16466259 ] Billie Rinaldi commented on YARN-8243: -- I think one problem here is that the ComponentInstance compareTo method is not implementing the sorting we would like. It seems that, based on the component instance ID assignment, we will always have to remove the instance with the highest ID first. This would not solve the general problem of removing running containers before pending instances, but it would solve the issue seen in the specific example provided in the description. > Flex down should first remove pending container requests (if any) and then > kill running containers > -- > > Key: YARN-8243 > URL: https://issues.apache.org/jira/browse/YARN-8243 > Project: Hadoop YARN > Issue Type: Sub-task > Components: yarn-native-services >Affects Versions: 3.1.0 >Reporter: Gour Saha >Assignee: Gour Saha >Priority: Major > > This is easy to test on a service with anti-affinity component, to simulate > pending container requests. It can be simulated by other means also (no > resource left in cluster, etc.). > Service yarnfile used to test this - > {code:java} > { > "name": "sleeper-service", > "version": "1", > "components" : > [ > { > "name": "ping", > "number_of_containers": 2, > "resource": { > "cpus": 1, > "memory": "256" > }, > "launch_command": "sleep 9000", > "placement_policy": { > "constraints": [ > { > "type": "ANTI_AFFINITY", > "scope": "NODE", > "target_tags": [ > "ping" > ] > } > ] > } > } > ] > } > {code} > Launch a service with the above yarnfile as below - > {code:java} > yarn app -launch simple-aa-1 simple_AA.json > {code} > Let's assume there are only 5 nodes in this cluster. Now, flex the above > service to 1 extra container than the number of nodes (6 in my case). > {code:java} > yarn app -flex simple-aa-1 -component ping 6 > {code} > Only 5 containers will be allocated and running for simple-aa-1. At this > point, flex it down to 5 containers - > {code:java} > yarn app -flex simple-aa-1 -component ping 5 > {code} > This is what is seen in the serviceam log at this point - > {noformat} > 2018-05-03 20:17:38,469 [IPC Server handler 0 on 38124] INFO > service.ClientAMService - Flexing component ping to 5 > 2018-05-03 20:17:38,469 [Component dispatcher] INFO component.Component - > [FLEX DOWN COMPONENT ping]: scaling down from 6 to 5 > 2018-05-03 20:17:38,470 [Component dispatcher] INFO > instance.ComponentInstance - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Flexed down by user, destroying. > 2018-05-03 20:17:38,473 [Component dispatcher] INFO component.Component - > [COMPONENT ping] Transitioned from FLEXING to STABLE on FLEX event. > 2018-05-03 20:17:38,474 [pool-5-thread-8] INFO > registry.YarnRegistryViewForProviders - [COMPINSTANCE ping-4 : > container_1525297086734_0013_01_06]: Deleting registry path > /users/root/services/yarn-service/simple-aa-1/components/ctr-1525297086734-0013-01-06 > 2018-05-03 20:17:38,476 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at > org.apache.hadoop.yarn.state.StateMachineFactory$MultipleInternalArc.doTransition(StateMachineFactory.java:388) > at > org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302) > at > org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46) > at > org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487) > at > org.apache.hadoop.yarn.service.component.Component.handle(Component.java:913) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:574) > at > org.apache.hadoop.yarn.service.ServiceScheduler$ComponentEventHandler.handle(ServiceScheduler.java:563) > at > org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197) > at > org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126) > at java.lang.Thread.run(Thread.java:745) > 2018-05-03 20:17:38,480 [Component dispatcher] ERROR component.Component - > [COMPONENT ping]: Invalid event CHECK_STABLE at STABLE > org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: > CHECK_STABLE at STABLE > at >