[
https://issues.apache.org/jira/browse/YARN-8331?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16573432#comment-16573432
]
Jason Lowe commented on YARN-8331:
----------------------------------
Thanks for the patch, [~pradeepambati]!
In ContainersLauncher the exit code of a container that never launched should
not be SUCCESS. It should be treated the same as it is done in
ContainerLaunch#validateContainerState when a container is killed before it is
launched, and the error message should match as well since it's the same type
of scenario.
Now that ContainersLauncher is sending the killed on request event even when it
doesn't think the container is being launched or running at all, do we need to
add some event handling for CONTAINER_KILLED_ON_REQUEST to more states
following KILLING? I'm thinking it is now possible for ContainerImpl to
receive multiple of these which the state machine does not expect. For example
consider the following scenario:
# Container in SCHEDULED state waiting for LAUNCHED event
# Container receives KILL event and sends kill to launcher
# In the meantime container proceeds to launch a very quick process and
LAUNCHED is sent to the container. Event is ignored because the container is
in KILLING state.
# Process runs extremely fast, so launcher sends CONTAINER_EXITED_WITH_SUCCESS
event to the container and clears container from ContainersLauncher.
# Container receives exit success event in the KILLING state and moves to the
EXITED_WITH_SUCCESS state.
# Launcher now finally receives the request to kill the container, but the
container is no longer being tracked. KILLED_ON_REQUEST is sent to the
container as part of this patch.
# Container receives KILLED_ON_REQUEST event in the EXITED_WITH_SUCCESS state
which is an invalid transition. Similarly if it had progressed to the DONE
state it also would be an invalid state transition.
I think we need to add some no-op transitions in states that can occur after
KILLING if KILLED_ON_REQUEST is received to cover these race conditions.
The event field of LocalizationCleanupMatcher is not used and is not needed.
Checking for an instance of ContainerLocalizationCleanupEvent means we are also
verifying the event type is CLEANUP_CONTAINER_RESOURCES. Alternatively
LocalizationCleanupMatcher could check for instanceof
ContainerLocalizationEvent and verify the event is CLEANUP_CONTAINER_RESOURCES.
> Race condition in NM container launched after done
> --------------------------------------------------
>
> Key: YARN-8331
> URL: https://issues.apache.org/jira/browse/YARN-8331
> Project: Hadoop YARN
> Issue Type: Bug
> Affects Versions: 3.1.0, 2.9.1, 3.0.2
> Reporter: Yang Wang
> Assignee: Pradeep Ambati
> Priority: Major
> Attachments: YARN-8331.001.patch
>
>
> When a container is launching, in ContainerLaunch#launchContainer, state is
> SCHEDULED,
> kill event was sent to this container, state : SCHEDULED->KILLING->DONE
> Then ContainerLaunch send CONTAINER_LAUNCHED event and start the container
> processes. These absent container processes will not be cleaned up anymore.
>
> {code:java}
> 2018-05-21 13:11:56,114 INFO [Thread-11] nodemanager.NMAuditLogger
> (NMAuditLogger.java:logSuccess(94)) - USER=nobody OPERATION=Start Container
> Request TARGET=ContainerManageImpl RESULT=SUCCESS
> APPID=application_0_0000 CONTAINERID=container_0_0000_01_000000
> 2018-05-21 13:11:56,114 INFO [NM ContainerManager dispatcher]
> application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application
> application_0_0000 transitioned from NEW to INITING
> 2018-05-21 13:11:56,114 INFO [NM ContainerManager dispatcher]
> application.ApplicationImpl (ApplicationImpl.java:transition(446)) - Adding
> container_0_0000_01_000000 to application application_0_0000
> 2018-05-21 13:11:56,118 INFO [NM ContainerManager dispatcher]
> application.ApplicationImpl (ApplicationImpl.java:handle(632)) - Application
> application_0_0000 transitioned from INITING to RUNNING
> 2018-05-21 13:11:56,119 INFO [NM ContainerManager dispatcher]
> container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container
> container_0_0000_01_000000 transitioned from NEW to SCHEDULED
> 2018-05-21 13:11:56,119 INFO [NM ContainerManager dispatcher]
> containermanager.AuxServices (AuxServices.java:handle(220)) - Got event
> CONTAINER_INIT for appId application_0_0000
> 2018-05-21 13:11:56,119 INFO [NM ContainerManager dispatcher]
> scheduler.ContainerScheduler (ContainerScheduler.java:startContainer(504)) -
> Starting container [container_0_0000_01_000000]
> 2018-05-21 13:11:56,226 INFO [NM ContainerManager dispatcher]
> container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container
> container_0_0000_01_000000 transitioned from SCHEDULED to KILLING
> 2018-05-21 13:11:56,227 INFO [NM ContainerManager dispatcher]
> containermanager.TestContainerManager
> (BaseContainerManagerTest.java:delete(287)) - Psuedo delete: user - nobody,
> type - FILE
> 2018-05-21 13:11:56,227 INFO [NM ContainerManager dispatcher]
> nodemanager.NMAuditLogger (NMAuditLogger.java:logSuccess(94)) - USER=nobody
> OPERATION=Container Finished - Killed TARGET=ContainerImpl
> RESULT=SUCCESS APPID=application_0_0000
> CONTAINERID=container_0_0000_01_000000
> 2018-05-21 13:11:56,238 INFO [NM ContainerManager dispatcher]
> container.ContainerImpl (ContainerImpl.java:handle(2111)) - Container
> container_0_0000_01_000000 transitioned from KILLING to DONE
> 2018-05-21 13:11:56,238 INFO [NM ContainerManager dispatcher]
> application.ApplicationImpl (ApplicationImpl.java:transition(489)) - Removing
> container_0_0000_01_000000 from application application_0_0000
> 2018-05-21 13:11:56,239 INFO [NM ContainerManager dispatcher]
> monitor.ContainersMonitorImpl
> (ContainersMonitorImpl.java:onStopMonitoringContainer(932)) - Stopping
> resource-monitoring for container_0_0000_01_000000
> 2018-05-21 13:11:56,239 INFO [NM ContainerManager dispatcher]
> containermanager.AuxServices (AuxServices.java:handle(220)) - Got event
> CONTAINER_STOP for appId application_0_0000
> 2018-05-21 13:11:56,274 WARN [NM ContainerManager dispatcher]
> container.ContainerImpl (ContainerImpl.java:handle(2106)) - Can't handle this
> event at current state: Current: [DONE], eventType: [CONTAINER_LAUNCHED],
> container: [container_0_0000_01_000000]
> org.apache.hadoop.yarn.state.InvalidStateTransitionException: Invalid event:
> CONTAINER_LAUNCHED at DONE
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory.access$500(StateMachineFactory.java:46)
> at
> org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:487)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:2104)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl.handle(ContainerImpl.java:104)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1525)
> at
> org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher.handle(ContainerManagerImpl.java:1518)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:197)
> at
> org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:126)
> at java.lang.Thread.run(Thread.java:748)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]