[jira] [Updated] (YARN-4502) Sometimes Two AM containers get launched
[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-4502: -- Attachment: YARN-4502-20160114.txt Tx for the review, [~leftnoteasy] bq. Do you think is it better to rename PREEMPT_CONTAINER/preemptContainer in CapacityScheduler to something like add_preemption_candidate or mark-container-for-preemption? Preempt-container looks very similar to kill-preempted-container. Similarily, FiCaSchedulerApp#preemptContainer. Makes sense, that is what I was thinking before too but didn't make the change. Uploading a new patch with that comment addressed. Also fixed whitespace and checkstyle issue. Cannot address the too-long-file warning. > Sometimes Two AM containers get launched > > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > Attachments: YARN-4502-20160114.txt, YARN-4502-20160212.txt > > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 6 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_02 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10: > Total number of containers :2 > Container-Id Start Time Finish Time > StateHost Node Http Address >LOG-URL > container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa > container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_01. But AM was not launched on > container_e12_1450825622869_0001_02_01. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_02 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4502) Sometimes Two AM containers get launched
[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kumar Vavilapalli updated YARN-4502: -- Attachment: YARN-4502-20160212.txt Here's a patch fixing this. The core fix is relatively concentrated. We now perform kill-container and recover-requests in one shot without all the events (which was broken anyways). Much of the remaining changes are for test-cases and a bit of renames and refactors. Summary of the changes: - AbstractYarnScheduler: Renamed existing completedContainer() method to be {{completedContainerInternal()}} and added a new {{completedContainer()}} which wraps around completedContainerInternal(), does common null-checks and also recovers ResourceRequests as needed - SchedulerEventType.KILL_CONTAINER ->SchedulerEventType.KILL_PREEMPTED_CONTAINER and DROP_RESERVATION -> KILL_RESERVED_CONTAINER - Got rid of ContainerRescheduledTransition completely. No need for container to send an event to the scheduler. Once this is removed, got rid of ContainerRescheduledEvent too. - Moved ContainerPreemptEvent from org.apache.hadoop.yarn.server.resourcemanager.scheduler package into org.apache.hadoop.yarn.server.resourcemanager.scheduler.event. That is where it belongs. - PreemptableResourceScheduler: Renamed APIs dropContainerReservation -> killReservedContainer and killContainer -> killPreemptedContainer - FiCaSchedulerApp.addPreemptContainer -> preemptContainer > Sometimes Two AM containers get launched > > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Vinod Kumar Vavilapalli >Priority: Critical > Attachments: YARN-4502-20160212.txt > > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 6 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_02 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10: > Total number of containers :2 > Container-Id Start Time Finish Time > StateHost Node Http Address >LOG-URL > container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa > container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_01. But AM was not launched on > container_e12_1450825622869_0001_02_01. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_02 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4502) Sometimes Two AM containers get launched
[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4502: - Labels: 2.6.4-candidate (was: ) > Sometimes Two AM containers get launched > > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Wangda Tan >Priority: Critical > Labels: 2.6.4-candidate > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 6 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_02 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10: > Total number of containers :2 > Container-Id Start Time Finish Time > StateHost Node Http Address >LOG-URL > container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa > container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_01. But AM was not launched on > container_e12_1450825622869_0001_02_01. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_02 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4502) Sometimes Two AM containers get launched
[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4502: - Labels: (was: 2.6.4-candidate) > Sometimes Two AM containers get launched > > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Wangda Tan >Priority: Critical > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 6 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_02 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10: > Total number of containers :2 > Container-Id Start Time Finish Time > StateHost Node Http Address >LOG-URL > container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa > container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_01. But AM was not launched on > container_e12_1450825622869_0001_02_01. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_02 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (YARN-4502) Sometimes Two AM containers get launched
[ https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Junping Du updated YARN-4502: - Target Version/s: 2.6.4 > Sometimes Two AM containers get launched > > > Key: YARN-4502 > URL: https://issues.apache.org/jira/browse/YARN-4502 > Project: Hadoop YARN > Issue Type: Bug >Reporter: Yesha Vora >Assignee: Wangda Tan >Priority: Critical > > Scenario : > * set yarn.resourcemanager.am.max-attempts = 2 > * start dshell application > {code} > yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar > hadoop-yarn-applications-distributedshell-*.jar > -attempt_failures_validity_interval 6 -shell_command "sleep 150" > -num_containers 16 > {code} > * Kill AM pid > * Print container list for 2nd attempt > {code} > yarn container -list appattempt_1450825622869_0001_02 > INFO impl.TimelineClientImpl: Timeline service address: > http://xxx:port/ws/v1/timeline/ > INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10: > Total number of containers :2 > Container-Id Start Time Finish Time > StateHost Node Http Address >LOG-URL > container_e12_1450825622869_0001_02_02 Tue Dec 22 23:07:35 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_02/hrt_qa > container_e12_1450825622869_0001_02_01 Tue Dec 22 23:07:34 + 2015 > N/A RUNNINGxxx:25454 http://xxx:8042 > http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_01/hrt_qa > {code} > * look for new AM pid > Here, 2nd AM container was suppose to be started on > container_e12_1450825622869_0001_02_01. But AM was not launched on > container_e12_1450825622869_0001_02_01. It was in AQUIRED state. > On other hand, container_e12_1450825622869_0001_02_02 got the AM running. > Expected behavior: RM should not start 2 containers for starting AM -- This message was sent by Atlassian JIRA (v6.3.4#6332)