[ 
https://issues.apache.org/jira/browse/YARN-8160?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572593#comment-16572593
 ] 

Chandni Singh commented on YARN-8160:
-------------------------------------

{quote}
 Exit code 255 is coming from docker inspect 
container_e02_1533231998644_0009_01_000003. There looks like a race condition 
where ContainerLaunch thread has issued the termination on docker container 
pid. LinuxContainerExecutor still has a independent child process that is 
checking the liveness of the docker container.
{quote}
[~eyang], the container exit code comes from the below stmt in 
{{ContainerLaunch.call()}}
{code}
ret = launchContainer(new ContainerStartContext.Builder()
          .setContainer(container)
          .setLocalizedResources(localResources)
          .setNmPrivateContainerScriptPath(nmPrivateContainerScriptPath)
          .setNmPrivateTokensPath(nmPrivateTokensPath)
          .setUser(user)
          .setAppId(appIdStr)
          .setContainerWorkDir(containerWorkDir)
          .setLocalDirs(localDirs)
          .setLogDirs(logDirs)
          .setFilecacheDirs(filecacheDirs)
          .setUserLocalDirs(userLocalDirs)
          .setContainerLocalDirs(containerLocalDirs)
          .setContainerLogDirs(containerLogDirs)
          .setUserFilecacheDirs(userFilecacheDirs)
          .setApplicationLocalDirs(applicationLocalDirs).build());
{code}

The docker inspect of the container that has been stopped and cleaned would 
just tell the container is not alive. How does it affect the container's exit 
code? I cannot find this in the code. Could you please point me to it?

I still think, below are the only 2 solutions for this:
1. In node manager, if a container is in REINITIALIZING_AWAITING_KILL and gets 
a CONTAINER_EXITED_WITH_FAILURE event, then it should handle it in the similar 
way as it currently handle the CONTAINER_KILLED_ON_REQUEST.

2. cleanup of container files is not performed until the container exits


> Yarn Service Upgrade: Support upgrade of service that use docker containers 
> ----------------------------------------------------------------------------
>
>                 Key: YARN-8160
>                 URL: https://issues.apache.org/jira/browse/YARN-8160
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Chandni Singh
>            Assignee: Chandni Singh
>            Priority: Major
>              Labels: Docker
>         Attachments: container_e02_1533231998644_0009_01_000003.nm.log
>
>
> Ability to upgrade dockerized  yarn native services.
> Ref: YARN-5637
> *Background*
> Container upgrade is supported by the NM via {{reInitializeContainer}} api. 
> {{reInitializeContainer}} does *NOT* change the ContainerId of the upgraded 
> container.
> NM performs the following steps during {{reInitializeContainer}}:
> - kills the existing process
> - cleans up the container
> - launches another container with the new {{ContainerLaunchContext}}
> NOTE: {{ContainerLaunchContext}} holds all the information that needs to 
> upgrade the container.
> With {{reInitializeContainer}}, the following does *NOT* change
> - container ID. This is not created by NM. It is provided to it and here RM 
> is not creating another container allocation.
> - {{localizedResources}} this stays the same if the upgrade does *NOT* 
> require additional resources IIUC.
>  
> The following changes with {{reInitializeContainer}}
> - the working directory of the upgraded container changes. It is *NOT* a 
> relaunch. 
> *Changes required in the case of docker container*
> - {{reInitializeContainer}} seems to not be working with Docker containers. 
> Investigate and fix this.
> - [Future change] Add an additional api to NM to pull the images and modify 
> {{reInitializeContainer}} to trigger docker container launch without pulling 
> the image first which could be based on a flag.
>     -- When the service upgrade is initialized, we can provide the user with 
> an option to just pull the images  on the NMs.
>     -- When a component instance is upgrade, it calls the 
> {{reInitializeContainer}} with the flag pull-image set to false, since the NM 
> will have already pulled the images.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to