[ 
https://issues.apache.org/jira/browse/YARN-9194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16742175#comment-16742175
 ] 

lujie commented on YARN-9194:
-----------------------------

HI:[~wilfreds]

Thanks for your kindly review!
{code:java}
That looks strange: assigning a container on a node that has already been 
removed. 
{code}
I think it is the root cause for bug YARN-9193, it causes the NPE. when I fixed 
it, I just want to prevent the error happening. I think it is hard for :
{code:java}
check the proposal and make sure that it is declined when the node is removed
{code}
because even we add checking of the node status, the node can shudown after 
checking condition. It is hard for us to  control the node shutdown timing.
{code:java}
The container is never started and the failure is inside the scheduler. Failing 
an application when that happens is I don't think the correct action.
{code}
yes,I think the FAILED state is improper here. Meanwhile, from the log i found 
that  

CONTAINER_FINISHED was sent to application later and changed the applciation 
state 

from SCHEDULED to FINAL_SAVING, So ii think we should keep the state as 
SCHEDULED , just like this scene:
{code:java}
// Note that YarnScheduler#allocate is not guaranteed to be able to
// fetch it since container may not be fetchable for some reason like
// DNS unavailable causing container token not generated. As such, we
// return to the previous state and keep retry until am container is
// fetched.
if (amContainerAllocation.getContainers().size() == 0) {
   appAttempt.retryFetchingAMContainer(appAttempt);
   return RMAppAttemptState.SCHEDULED;
}
{code}
So I think that  preventting the NPE and keeping the SCHEDULED state are ok 
here. It is too hard to fix the root cause.

> Invalid event: REGISTERED at FAILED
> -----------------------------------
>
>                 Key: YARN-9194
>                 URL: https://issues.apache.org/jira/browse/YARN-9194
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: lujie
>            Assignee: lujie
>            Priority: Major
>         Attachments: YARN-9194_1.patch, YARN-9194_2.patch, YARN-9194_3.patch, 
> hadoop-hires-resourcemanager-hadoop11.log
>
>
> While the attempt fails, the REGISTERED comes, hence the 
> InvalidStateTransitionException happens.
>  
> {code:java}
> 2019-01-13 00:41:57,127 ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: 
> App attempt: appattempt_1547311267249_0001_000002 can't handle this event at 
> current state
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> REGISTERED at FAILED
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:913)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:121)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1073)
> at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:1054)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:745)
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to