[ 
https://issues.apache.org/jira/browse/YARN-8331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573432#comment-16573432
 ] 

Jason Lowe commented on YARN-8331:
----------------------------------

Thanks for the patch, [~pradeepambati]!

In ContainersLauncher the exit code of a container that never launched should 
not be SUCCESS.  It should be treated the same as it is done in 
ContainerLaunch#validateContainerState when a container is killed before it is 
launched, and the error message should match as well since it's the same type 
of scenario.

Now that ContainersLauncher is sending the killed on request event even when it 
doesn't think the container is being launched or running at all, do we need to 
add some event handling for CONTAINER_KILLED_ON_REQUEST to more states 
following KILLING?  I'm thinking it is now possible for ContainerImpl to 
receive multiple of these which the state machine does not expect.  For example 
consider the following scenario:
# Container in SCHEDULED state waiting for LAUNCHED event
# Container receives KILL event and sends kill to launcher
# In the meantime container proceeds to launch a very quick process and 
LAUNCHED is sent to the container.  Event is ignored because the container is 
in KILLING state.
# Process runs extremely fast, so launcher sends CONTAINER_EXITED_WITH_SUCCESS 
event to the container and clears container from ContainersLauncher.
# Container receives exit success event in the KILLING state and moves to the 
EXITED_WITH_SUCCESS state.
# Launcher now finally receives the request to kill the container, but the 
container is no longer being tracked.  KILLED_ON_REQUEST is sent to the 
container as part of this patch.
# Container receives KILLED_ON_REQUEST event in the EXITED_WITH_SUCCESS state 
which is an invalid transition.  Similarly if it had progressed to the DONE 
state it also would be an invalid state transition.

I think we need to add some no-op transitions in states that can occur after 
KILLING if KILLED_ON_REQUEST is received to cover these race conditions.

The event field of LocalizationCleanupMatcher is not used and is not needed.  
Checking for an instance of ContainerLocalizationCleanupEvent means we are also 
verifying the event type is CLEANUP_CONTAINER_RESOURCES.  Alternatively 
LocalizationCleanupMatcher could check for instanceof 
ContainerLocalizationEvent and verify the event is CLEANUP_CONTAINER_RESOURCES.


> Race condition in NM container launched after done
> --------------------------------------------------
>
>                 Key: YARN-8331
>                 URL: https://issues.apache.org/jira/browse/YARN-8331
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 3.1.0, 2.9.1, 3.0.2
>            Reporter: Yang Wang
>            Assignee: Pradeep Ambati
>            Priority: Major
>         Attachments: YARN-8331.001.patch
>
>
> When a container is launching, in ContainerLaunch#launchContainer, state is 
> SCHEDULED,
> kill event was sent to this container, state : SCHEDULED->KILLING->DONE
>  Then ContainerLaunch send CONTAINER_LAUNCHED event and start the container 
> processes. These absent container processes will not be cleaned up anymore.
>  
> {code:java}
> 2018-05-21 13:11:56,114 INFO  [Thread-11] nodemanager.NMAuditLogger 
> (NMAuditLogger.java:logSuccess(94)) - USER=nobody OPERATION=Start Container 
> Request       TARGET=ContainerManageImpl      RESULT=SUCCESS  
> APPID=application_0_0000        CONTAINERID=container_0_0000_01_000000
> 2018-05-21 13:11:56,114 INFO  [NM ContainerManager dispatcher] 
> application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application 
> application_0_0000 transitioned from NEW to INITING
> 2018-05-21 13:11:56,114 INFO  [NM ContainerManager dispatcher] 
> application.ApplicationImpl (ApplicationImpl.java:transition(446)) - Adding 
> container_0_0000_01_000000 to application application_0_0000
> 2018-05-21 13:11:56,118 INFO  [NM ContainerManager dispatcher] 
> application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application 
> application_0_0000 transitioned from INITING to RUNNING
> 2018-05-21 13:11:56,119 INFO  [NM ContainerManager dispatcher] 
> container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container 
> container_0_0000_01_000000 transitioned from NEW to SCHEDULED
> 2018-05-21 13:11:56,119 INFO  [NM ContainerManager dispatcher] 
> containermanager.AuxServices (AuxServices.java:handle(220)) - Got event 
> CONTAINER_INIT for appId application_0_0000
> 2018-05-21 13:11:56,119 INFO  [NM ContainerManager dispatcher] 
> scheduler.ContainerScheduler (ContainerScheduler.java:startContainer(504)) - 
> Starting container [container_0_0000_01_000000]
> 2018-05-21 13:11:56,226 INFO  [NM ContainerManager dispatcher] 
> container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container 
> container_0_0000_01_000000 transitioned from SCHEDULED to KILLING
> 2018-05-21 13:11:56,227 INFO  [NM ContainerManager dispatcher] 
> containermanager.TestContainerManager 
> (BaseContainerManagerTest.java:delete(287)) - Psuedo delete: user - nobody, 
> type - FILE
> 2018-05-21 13:11:56,227 INFO  [NM ContainerManager dispatcher] 
> nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(94)) - USER=nobody   
>  OPERATION=Container Finished - Killed   TARGET=ContainerImpl    
> RESULT=SUCCESS  APPID=application_0_0000        
> CONTAINERID=container_0_0000_01_000000
> 2018-05-21 13:11:56,238 INFO  [NM ContainerManager dispatcher] 
> container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container 
> container_0_0000_01_000000 transitioned from KILLING to DONE
> 2018-05-21 13:11:56,238 INFO  [NM ContainerManager dispatcher] 
> application.ApplicationImpl (ApplicationImpl.java:transition(489)) - Removing 
> container_0_0000_01_000000 from application application_0_0000
> 2018-05-21 13:11:56,239 INFO  [NM ContainerManager dispatcher] 
> monitor.ContainersMonitorImpl 
> (ContainersMonitorImpl.java:onStopMonitoringContainer(932)) - Stopping 
> resource-monitoring for container_0_0000_01_000000
> 2018-05-21 13:11:56,239 INFO  [NM ContainerManager dispatcher] 
> containermanager.AuxServices (AuxServices.java:handle(220)) - Got event 
> CONTAINER_STOP for appId application_0_0000
> 2018-05-21 13:11:56,274 WARN  [NM ContainerManager dispatcher] 
> container.ContainerImpl (ContainerImpl.java:handle(2106)) - Can't handle this 
> event at current state: Current: [DONE], eventType: [CONTAINER_LAUNCHED], 
> container: [container_0_0000_01_000000]
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event: 
> CONTAINER_LAUNCHED at DONE
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
>       at 
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:2104)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:104)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1525)
>       at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1518)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
>       at 
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
>       at java.lang.Thread.run(Thread.java:748)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to