[
https://issues.apache.org/jira/browse/YARN-4502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15092612#comment-15092612
]
Vinod Kumar Vavilapalli commented on YARN-4502:
-----------------------------------------------
[~leftnoteasy], I started writing a test for this assuming the previous
hypothesis, and it doesn't add up.
bq. After YARN-3535, all containers transition from ALLOCATED to KILLED state
will be re-added to scheduler. And such resource request will be added to
current scheduler application attempt.
Two cases here
# If the container (in allocated state) got killed before the AM crash, it
will get added to the app-attempt #1, so this bug won't happen
# An allocated container simply doesn't survive AM crash (both when
keepContainerAcrossApplicationAttempt is on and off) - scheduler itself kills
all allocated containers right after AM crashes as part of
{{doneApplicationAtttempt()}}. And these killed containers also get added to
the app-attempt #1 because current-app-attempt is not switched till
{{addApplicationAttempt()}} comes in for the new app-attempt.
So, it doesn't look like our previous analysis is right. /cc [~jianhe]
[~yeshavora], do you have the RM logs?
> Sometimes Two AM containers get launched
> ----------------------------------------
>
> Key: YARN-4502
> URL: https://issues.apache.org/jira/browse/YARN-4502
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: Yesha Vora
> Assignee: Wangda Tan
> Priority: Critical
>
> Scenario :
> * set yarn.resourcemanager.am.max-attempts = 2
> * start dshell application
> {code}
> yarn org.apache.hadoop.yarn.applications.distributedshell.Client -jar
> hadoop-yarn-applications-distributedshell-*.jar
> -attempt_failures_validity_interval 60000 -shell_command "sleep 150"
> -num_containers 16
> {code}
> * Kill AM pid
> * Print container list for 2nd attempt
> {code}
> yarn container -list appattempt_1450825622869_0001_000002
> INFO impl.TimelineClientImpl: Timeline service address:
> http://xxx:port/ws/v1/timeline/
> INFO client.RMProxy: Connecting to ResourceManager at xxx/10.10.10.10:<port>
> Total number of containers :2
> Container-Id Start Time Finish Time
> State Host Node Http Address
> LOG-URL
> container_e12_1450825622869_0001_02_000002 Tue Dec 22 23:07:35 +0000 2015
> N/A RUNNING xxx:25454 http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_000002/hrt_qa
> container_e12_1450825622869_0001_02_000001 Tue Dec 22 23:07:34 +0000 2015
> N/A RUNNING xxx:25454 http://xxx:8042
> http://xxx:8042/node/containerlogs/container_e12_1450825622869_0001_02_000001/hrt_qa
> {code}
> * look for new AM pid
> Here, 2nd AM container was suppose to be started on
> container_e12_1450825622869_0001_02_000001. But AM was not launched on
> container_e12_1450825622869_0001_02_000001. It was in AQUIRED state.
> On other hand, container_e12_1450825622869_0001_02_000002 got the AM running.
> Expected behavior: RM should not start 2 containers for starting AM
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)