[
https://issues.apache.org/jira/browse/YARN-8160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16570774#comment-16570774
]
Chandni Singh commented on YARN-8160:
-------------------------------------
Attached are the logs of ctr005 that fails to re-initialize. When it is
re-initialized, the container is stopped and cleanup. This causes the container
to exit but here it exits with code {{255}} instead of {{FORCE_KILLED}} or
{{TERMINATED}}.
Since the container exits with a failure code, that is {{255}}, the status of
the container in NM changes from {{REINITIALIZING_AWAITING_KILL}} to
{{EXITED_WITH_FAILURE}}.
Below are the relevant log stmts:
1. Reinit of the container is triggered
{code}
ctr005.log:2018-08-02 22:30:41,100 DEBUG container.ContainerImpl
(ContainerImpl.java:handle(2080)) - Processing
container_e02_1533231998644_0009_01_000003 of type REINITIALIZE_CONTAINER
ctr005.log:2018-08-02 22:30:41,101 INFO container.ContainerImpl
(ContainerImpl.java:handle(2093)) - Container
container_e02_1533231998644_0009_01_000003 transitioned from RUNNING to
REINITIALIZING_AWAITING_KIL
{code}
2. Reinit triggers cleanup of the container
{code}
ctr005.log:2018-08-02 22:30:41,102 INFO launcher.ContainerLaunch
(ContainerLaunch.java:cleanupContainer(734)) - Cleaning up container
container_e02_1533231998644_0009_01_000003
ctr005.log:2018-08-02 22:30:41,102 DEBUG recovery.NMLeveldbStateStoreService
(NMLeveldbStateStoreService.java:storeContainerKilled(555)) -
storeContainerKilled: containerId=container_e02_1533231998644_0009_01_000003
ctr005.log:2018-08-02 22:30:41,102 DEBUG launcher.ContainerLaunch
(ContainerLaunch.java:cleanupContainer(752)) - Marking container
container_e02_1533231998644_0009_01_000003 as inactive
ctr005.log:2018-08-02 22:30:41,102 DEBUG launcher.ContainerLaunch
(ContainerLaunch.java:cleanupContainer(759)) - Getting pid for container
container_e02_1533231998644_0009_01_000003 to kill from pid file
/tmp/hadoop/yarn/local/nmPrivate/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_e02_1533231998644_0009_01_000003.pid
ctr005.log:2018-08-02 22:30:41,102 DEBUG launcher.ContainerLaunch
(ContainerLaunch.java:getContainerPid(1084)) - Accessing pid for container
container_e02_1533231998644_0009_01_000003 from pid file
/tmp/hadoop/yarn/local/nmPrivate/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_e02_1533231998644_0009_01_000003.pid
ctr005.log:2018-08-02 22:30:41,102 DEBUG util.ProcessIdFileReader
(ProcessIdFileReader.java:getProcessId(53)) - Accessing pid from pid file
/tmp/hadoop/yarn/local/nmPrivate/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_e02_1533231998644_0009_01_000003.pid
ctr005.log:2018-08-02 22:30:41,102 DEBUG util.ProcessIdFileReader
(ProcessIdFileReader.java:getProcessId(103)) - Got pid 364708 from path
/tmp/hadoop/yarn/local/nmPrivate/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_e02_1533231998644_0009_01_000003.pid
ctr005.log:2018-08-02 22:30:41,102 DEBUG launcher.ContainerLaunch
(ContainerLaunch.java:getContainerPid(1096)) - Got pid 364708 for container
container_e02_1533231998644_0009_01_000003
ctr005.log:2018-08-02 22:30:41,102 DEBUG launcher.ContainerLaunch
(ContainerLaunch.java:signalProcess(919)) - Sending signal to pid 364708 as
user root for container container_e02_1533231998644_0009_01_000003
ctr005.log:2018-08-02 22:30:41,102 DEBUG docker.DockerCommandExecutor
(DockerCommandExecutor.java:executeDockerCommand(89)) - Running docker command:
inspect docker-command=inspect format=\{{.State.Status}}
name=container_e02_1533231998644_0009_01_000003
ctr005.log:2018-08-02 22:30:41,103 DEBUG privileged.PrivilegedOperationExecutor
(PrivilegedOperationExecutor.java:getPrivilegedOperationExecutionCommand(119))
- Privileged Execution Command Array:
[/hadoop_dist/hadoop-yarn/bin/container-executor, --inspect-docker-container,
--format=\{{.State.Status}}, container_e02_1533231998644_0009_01_000003]
ctr005.log:2018-08-02 22:30:41,129 DEBUG privileged.PrivilegedOperationExecutor
(PrivilegedOperationExecutor.java:executePrivilegedOperation(155)) -
[/hadoop_dist/hadoop-yarn/bin/container-executor, --inspect-docker-container,
--format=\{{.State.Status}}, container_e02_1533231998644_0009_01_000003]
ctr005.log:2018-08-02 22:30:41,130 DEBUG docker.DockerCommandExecutor
(DockerCommandExecutor.java:getContainerStatus(154)) - Container Status:
running ContainerId: container_e02_1533231998644_0009_01_000003
ctr005.log:2018-08-02 22:30:41,131 DEBUG docker.DockerCommandExecutor
(DockerCommandExecutor.java:executeDockerCommand(89)) - Running docker command:
stop docker-command=stop name=container_e02_1533231998644_0009_01_000003
{code}
3. After 10 seconds, the stop command sent to the executor completes and the
container is removed
{code}
ctr005.log:2018-08-02 22:30:51,251 DEBUG privileged.PrivilegedOperationExecutor
(PrivilegedOperationExecutor.java:executePrivilegedOperation(155)) -
[/hadoop_dist/hadoop-yarn/bin/container-executor, --run-docker,
/tmp/hadoop/yarn/local/nmPrivate/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/docker.container_e02_1533231998644_0009_01_0000038521705952835205058.cmd]
ctr005.log:2018-08-02 22:30:51,251 DEBUG privileged.PrivilegedOperationExecutor
(PrivilegedOperationExecutor.java:executePrivilegedOperation(157)) -
container_e02_1533231998644_0009_01_000003
ctr005.log:2018-08-02 22:30:51,251 DEBUG launcher.ContainerLaunch
(ContainerLaunch.java:signalProcess(927)) - Sent signal SIGTERM to pid 364708
as user root for container container_e02_1533231998644_0009_01_000003,
result=success
ctr005.log:2018-08-02 22:30:51,298 DEBUG docker.DockerCommandExecutor
(DockerCommandExecutor.java:executeDockerCommand(89)) - Running docker command:
rm docker-command=rm name=container_e02_1533231998644_0009_01_000003
ctr005.log:2018-08-02 22:30:51,298 DEBUG privileged.PrivilegedOperationExecutor
(PrivilegedOperationExecutor.java:getPrivilegedOperationExecutionCommand(119))
- Privileged Execution Command Array:
[/hadoop_dist/hadoop-yarn/bin/container-executor, --remove-docker-container,
container_e02_1533231998644_0009_01_000003]
ctr005.log:2018-08-02 22:30:51,977 DEBUG nodemanager.LinuxContainerExecutor
(LinuxContainerExecutor.java:postComplete(963)) -
container_e02_1533231998644_0009_01_000003 post complete
ctr005.log:2018-08-02 22:30:51,977 DEBUG resources.CGroupsHandlerImpl
(CGroupsHandlerImpl.java:deleteCGroup(535)) - deleteCGroup:
/sys/fs/cgroup/cpu/hadoop-yarn-tmp-ctr-e138-1518143905142-423707-01-000002.localhost/container_e02_1533231998644_0009_01_000003
ctr005.log:2018-08-02 22:30:51,997 DEBUG launcher.ContainerLaunch
(ContainerLaunch.java:cleanupContainerFiles(1876)) - cleanup container
/tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003
files
ctr005.log:2018-08-02 22:30:51,998 INFO nodemanager.LinuxContainerExecutor
(LinuxContainerExecutor.java:deleteAsUser(815)) - Deleting absolute path :
/tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/launch_container.sh
ctr005.log:2018-08-02 22:30:51,998 DEBUG privileged.PrivilegedOperationExecutor
(PrivilegedOperationExecutor.java:getPrivilegedOperationExecutionCommand(119))
- Privileged Execution Command Array:
[/hadoop_dist/hadoop-yarn/bin/container-executor, nobody, root, 3,
/tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/launch_container.sh]
ctr005.log:2018-08-02 22:30:52,006 DEBUG privileged.PrivilegedOperationExecutor
(PrivilegedOperationExecutor.java:executePrivilegedOperation(155)) -
[/hadoop_dist/hadoop-yarn/bin/container-executor, nobody, root, 3,
/tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/launch_container.sh]
ctr005.log:2018-08-02 22:30:52,006 INFO nodemanager.LinuxContainerExecutor
(LinuxContainerExecutor.java:deleteAsUser(815)) - Deleting absolute path :
/tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_tokens
ctr005.log:2018-08-02 22:30:52,006 DEBUG privileged.PrivilegedOperationExecutor
(PrivilegedOperationExecutor.java:getPrivilegedOperationExecutionCommand(119))
- Privileged Execution Command Array:
[/hadoop_dist/hadoop-yarn/bin/container-executor, nobody, root, 3,
/tmp/hadoop/yarn/local/usercache/root/appcache/application_1533231998644_0009/container_e02_1533231998644_0009_01_000003/container_tokens]
{code}
4. Meanwhile, the container exits with exit code 255
{code}
ctr005.log:2018-08-02 22:30:52,040 WARN nodemanager.LinuxContainerExecutor
(LinuxContainerExecutor.java:handleExitCode(585)) - Exit code from container
container_e02_1533231998644_0009_01_000003 is : 255
ctr005.log:2018-08-02 22:30:52,040 WARN nodemanager.LinuxContainerExecutor
(LinuxContainerExecutor.java:handleExitCode(591)) - Exception from
container-launch with container ID: container_e02_1533231998644_0009_01_000003
and exit code: 255
ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor
(ContainerExecutor.java:logOutput(541)) - Container id:
container_e02_1533231998644_0009_01_000003
ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor
(ContainerExecutor.java:logOutput(541)) - Shell error output: Error: No such
object: container_e02_1533231998644_0009_01_000003
ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor
(ContainerExecutor.java:logOutput(541)) - Could not inspect docker to get pid
/usr/bin/docker inspect --format \{{.State.Pid}}
container_e02_1533231998644_0009_01_000003.
ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor
(ContainerExecutor.java:logOutput(541)) - Error: No such object:
container_e02_1533231998644_0009_01_000003
ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor
(ContainerExecutor.java:logOutput(541)) - Could not inspect docker to get pid
/usr/bin/docker inspect --format \{{.State.Pid}}
container_e02_1533231998644_0009_01_000003.
ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor
(ContainerExecutor.java:logOutput(541)) - Error: No such object:
container_e02_1533231998644_0009_01_000003
ctr005.log:2018-08-02 22:30:52,041 INFO nodemanager.ContainerExecutor
(ContainerExecutor.java:logOutput(541)) - Could not inspect docker to get
exitcode: /usr/bin/docker inspect --format \{{.State.ExitCode}}
container_e02_1533231998644_0009_01_000003.
{code}
> Yarn Service Upgrade: Support upgrade of service that use docker containers
> ----------------------------------------------------------------------------
>
> Key: YARN-8160
> URL: https://issues.apache.org/jira/browse/YARN-8160
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Chandni Singh
> Assignee: Chandni Singh
> Priority: Major
> Labels: Docker
>
> Ability to upgrade dockerized yarn native services.
> Ref: YARN-5637
> *Background*
> Container upgrade is supported by the NM via {{reInitializeContainer}} api.
> {{reInitializeContainer}} does *NOT* change the ContainerId of the upgraded
> container.
> NM performs the following steps during {{reInitializeContainer}}:
> - kills the existing process
> - cleans up the container
> - launches another container with the new {{ContainerLaunchContext}}
> NOTE: {{ContainerLaunchContext}} holds all the information that needs to
> upgrade the container.
> With {{reInitializeContainer}}, the following does *NOT* change
> - container ID. This is not created by NM. It is provided to it and here RM
> is not creating another container allocation.
> - {{localizedResources}} this stays the same if the upgrade does *NOT*
> require additional resources IIUC.
>
> The following changes with {{reInitializeContainer}}
> - the working directory of the upgraded container changes. It is *NOT* a
> relaunch.
> *Changes required in the case of docker container*
> - {{reInitializeContainer}} seems to not be working with Docker containers.
> Investigate and fix this.
> - [Future change] Add an additional api to NM to pull the images and modify
> {{reInitializeContainer}} to trigger docker container launch without pulling
> the image first which could be based on a flag.
> -- When the service upgrade is initialized, we can provide the user with
> an option to just pull the images on the NMs.
> -- When a component instance is upgrade, it calls the
> {{reInitializeContainer}} with the flag pull-image set to false, since the NM
> will have already pulled the images.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]