[
https://issues.apache.org/jira/browse/YARN-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742061#comment-16742061
]
Wilfred Spiegelenburg commented on YARN-9194:
---------------------------------------------
Thank you for logging this jira [~xiaoheipangzi].
When I look at the logs I get the impression that the issue is in the way we
lock and or track the node:
{code}
2019-01-13 08:52:11,249 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: hadoop15:43450
Node Transitioned from RUNNING to SHUTDOWN
2019-01-13 08:52:15,221 INFO
org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1547340702286_0001_01_000001 Container Transitioned from NEW to
ALLOCATED
2019-01-13 08:52:15,224 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode:
Assigned container container_1547340702286_0001_01_000001 of capacity
<memory:2048, vCores:1> on host hadoop15:43450, which has 1 containers,
<memory:2048, vCores:1> used and <memory:6144, vCores:7> available after
allocation
...
2019-01-13 08:52:15,227 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
assignedContainer queue=root usedCapacity=0.125 absoluteUsedCapacity=0.125
used=<memory:2048, vCores:1> cluster=<memory:16384, vCores:16>
2019-01-13 08:52:15,234 INFO
org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Allocation proposal accepted
{code}
Based on this log there is a 4 second gap between allocation acceptance and the
node removal. The node was removed *4 seconds* before the allocation the
{{FiCaSchedulerNode}} was finished en the scheduler confirmed the allocation.
That looks strange: assigning a container on a node that has already been
removed. Based on this we probably should check the proposal and make sure that
it is declined when the node is removed.
I also don't think it is a good idea to fail the application in this case. The
container is never started and the failure is inside the scheduler. Failing an
application when that happens is I don't think the correct action.
> Invalid event: REGISTERED at FAILED
> -----------------------------------
>
> Key: YARN-9194
> URL: https://issues.apache.org/jira/browse/YARN-9194
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: lujie
> Assignee: lujie
> Priority: Major
> Attachments: YARN-9194_1.patch, YARN-9194_2.patch, YARN-9194_3.patch,
> hadoop-hires-resourcemanager-hadoop11.log
>
>
> While the attempt fails, the REGISTERED comes, hence the
> InvalidStateTransitionException happens.
>
> {code:java}
> 2019-01-13 00:41:57,127 ERROR
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
> App attempt: appattempt_1547311267249_0001_000002 can't handle this event at
> current state
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event:
> REGISTERED at FAILED
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]