[
https://issues.apache.org/jira/browse/YARN-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742175#comment-16742175
]
lujie commented on YARN-9194:
-----------------------------
HI:[~wilfreds]
Thanks for your kindly review!
{code:java}
That looks strange: assigning a container on a node that has already been
removed.
{code}
I think it is the root cause for bug YARN-9193, it causes the NPE. when I fixed
it, I just want to prevent the error happening. I think it is hard for :
{code:java}
check the proposal and make sure that it is declined when the node is removed
{code}
because even we add checking of the node status, the node can shudown after
checking condition. It is hard for us to control the node shutdown timing.
{code:java}
The container is never started and the failure is inside the scheduler. Failing
an application when that happens is I don't think the correct action.
{code}
yes,I think the FAILED state is improper here. Meanwhile, from the log i found
that
CONTAINER_FINISHED was sent to application later and changed the applciation
state
from SCHEDULED to FINAL_SAVING, So ii think we should keep the state as
SCHEDULED , just like this scene:
{code:java}
// Note that YarnScheduler#allocate is not guaranteed to be able to
// fetch it since container may not be fetchable for some reason like
// DNS unavailable causing container token not generated. As such, we
// return to the previous state and keep retry until am container is
// fetched.
if (amContainerAllocation.getContainers().size() == 0) {
appAttempt.retryFetchingAMContainer(appAttempt);
return RMAppAttemptState.SCHEDULED;
}
{code}
So I think that preventting the NPE and keeping the SCHEDULED state are ok
here. It is too hard to fix the root cause.
> Invalid event: REGISTERED at FAILED
> -----------------------------------
>
> Key: YARN-9194
> URL: https://issues.apache.org/jira/browse/YARN-9194
> Project: Hadoop YARN
> Issue Type: Bug
> Reporter: lujie
> Assignee: lujie
> Priority: Major
> Attachments: YARN-9194_1.patch, YARN-9194_2.patch, YARN-9194_3.patch,
> hadoop-hires-resourcemanager-hadoop11.log
>
>
> While the attempt fails, the REGISTERED comes, hence the
> InvalidStateTransitionException happens.
>
> {code:java}
> 2019-01-13 00:41:57,127 ERROR
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
> App attempt: appattempt_1547311267249_0001_000002 can't handle this event at
> current state
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event:
> REGISTERED at FAILED
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913)
> at
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073)
> at
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]