[ 
https://issues.apache.org/jira/browse/YARN-214?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13498941#comment-13498941
 ] 

Robert Joseph Evans commented on YARN-214:
------------------------------------------

Now that I understand the code better I think that ignoring the EXPIRE at the 
RUNNING state makes since.  The EXPIRE event only happens when a container has 
been waiting in allocated for more then 10 min (default config).  This really 
would only happen when an App has gotten a container and forgotten about it, or 
when the RM is running very slow and not processed the transition events by the 
time the EXPIRE event is sent.

We register for the Expire event in the AquiredTransition going to the AQUIRED 
State, so we need to handle the EXPIRE event at all states that are reachable 
from the AQUIRED state, and have not already processed the Expire event.  This 
means we need to handle this in the KILLED, RUNNING, COMPLETED, and RELEASED.  
We need to add this to KILLED and RELEASED too.
                
> RMContainerImpl does not handle event EXPIRE at state RUNNING
> -------------------------------------------------------------
>
>                 Key: YARN-214
>                 URL: https://issues.apache.org/jira/browse/YARN-214
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 0.23.3, 2.0.1-alpha
>            Reporter: Jason Lowe
>            Assignee: Jonathan Eagles
>         Attachments: YARN-214.patch, YARN-214.patch, YARN-214.patch, 
> YARN-214.patch
>
>
> RMContainerImpl has a race condition where a container can enter the RUNNING 
> state just as the container expires.  This results in an invalid event 
> transition error:
> {noformat}
> 2012-11-11 05:31:38,954 [ResourceManager Event Processor] ERROR 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: 
> Can't handle this event at current state
> org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: 
> EXPIRE at RUNNING
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:301)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:43)
>         at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:443)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:205)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl.handle(RMContainerImpl.java:44)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApp.containerCompleted(SchedulerApp.java:203)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1337)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainer(CapacityScheduler.java:739)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:659)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:80)
>         at 
> org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:340)
>         at java.lang.Thread.run(Thread.java:619)
> {noformat}
> EXPIRE needs to be handled (well at least ignored) in the RUNNING state to 
> account for this race condition.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to