[
https://issues.apache.org/jira/browse/YARN-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13977886#comment-13977886
]
Wangda Tan commented on YARN-1842:
----------------------------------
Took a look at this, I'm wondering if it's caused by this case
1) Client asked kill application,
2) After RM transferred application's state to killed, and before AM container
actually killed by NM, the AM asked to finish application
Since the RMAppAttempt already called AMS.unregisterAttempt, the attempt will
be cleaned from cache, thus the InvalidApplicationMasterRequestException will
be raised.
I guess this after reading log uploaded by [~keyki],
Still pretty good in following log,
{code}
2014-03-18 19:36:50,802 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1395167286771_0002 State change from ACCEPTED to RUNNING
2014-03-18 19:36:52,534 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1395167286771_0002_01_000002 Container Transitioned from NEW to
ALLOCATED
2014-03-18 19:36:52,534 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=keyki
OPERATION=AM Allocated Container TARGET=SchedulerApp RESULT=SUCCESS
APPID=application_1395167286771_0002
CONTAINERID=container_1395167286771_0002_01_000002
2014-03-18 19:36:52,534 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode:
Assigned container container_1395167286771_0002_01_000002 of capacity
<memory:1024, vCores:1> on host localhost:56214, which currently has 2
containers, <memory:2048, vCores:2> used and <memory:6144, vCores:6> available
2014-03-18 19:36:52,534 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
assignedContainer application=application_1395167286771_0002
container=Container: [ContainerId: container_1395167286771_0002_01_000002,
NodeId: localhost:56214, NodeHttpAddress: localhost:8042, Resource:
<memory:1024, vCores:1>, Priority: 1, Token: Token { kind: ContainerToken,
service: 127.0.0.1:56214 }, ]
containerId=container_1395167286771_0002_01_000002 queue=default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:1024, vCores:1>usedCapacity=0.125,
absoluteUsedCapacity=0.125, numApps=1, numContainers=1 usedCapacity=0.125
absoluteUsedCapacity=0.125 used=<memory:1024, vCores:1> cluster=<memory:8192,
vCores:8>
2014-03-18 19:36:52,534 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Re-sorting assigned queue: root.default stats: default: capacity=1.0,
absoluteCapacity=1.0, usedResources=<memory:2048, vCores:2>usedCapacity=0.25,
absoluteUsedCapacity=0.25, numApps=1, numContainers=2
2014-03-18 19:36:52,535 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
assignedContainer queue=root usedCapacity=0.25 absoluteUsedCapacity=0.25
used=<memory:2048, vCores:2> cluster=<memory:8192, vCores:8>
2014-03-18 19:36:52,961 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1395167286771_0002_01_000002 Container Transitioned from ALLOCATED to
ACQUIRED
2014-03-18 19:36:53,536 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1395167286771_0002_01_000002 Container Transitioned from ACQUIRED to
RUNNING
{code}
Client asked kill application, and AMS.unregisterAttempt called, attempt will
be removed from AMS cache
{code}
2014-03-18 19:38:50,427 INFO
org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=keyki
IP=37.139.29.192 OPERATION=Kill Application Request
TARGET=ClientRMService RESULT=SUCCESS APPID=application_1395167286771_0002
2014-03-18 19:38:50,427 INFO
org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore: Removing
info for app: application_1395167286771_0002
2014-03-18 19:38:50,427 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1395167286771_0002 State change from RUNNING to KILLED
2014-03-18 19:38:50,428 INFO
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
Unregistering app attempt : appattempt_1395167286771_0002_000001
{code}
After that, AM asked finishApplication, but unfortunately, attempt is already
removed from cache
{code}
2014-03-18 19:38:51,397 ERROR
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
AppAttemptId doesnt exist in cache appattempt_1395167286771_0002_000001
2014-03-18 19:38:52,415 ERROR
org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
Application doesn't exist in cache appattempt_1395167286771_0002_000001
{code}
I'm not sure if it's possible in current Hoya design, please correct me if I
was wrong.
> InvalidApplicationMasterRequestException raised during AM-requested shutdown
> ----------------------------------------------------------------------------
>
> Key: YARN-1842
> URL: https://issues.apache.org/jira/browse/YARN-1842
> Project: Hadoop YARN
> Issue Type: Bug
> Components: resourcemanager
> Affects Versions: 2.3.0
> Reporter: Steve Loughran
> Priority: Minor
> Attachments: hoyalogs.tar.gz
>
>
> Report of the RM raising a stack trace
> [https://gist.github.com/matyix/9596735] during AM-initiated shutdown. The AM
> could just swallow this and exit, but it could be a sign of a race condition
> YARN-side, or maybe just in the RM client code/AM dual signalling the
> shutdown.
> I haven't replicated this myself; maybe the stack will help track down the
> problem. Otherwise: what is the policy YARN apps should adopt for AM's
> handling errors on shutdown? go straight to an exit(-1)?
--
This message was sent by Atlassian JIRA
(v6.2#6252)