[ 
https://issues.apache.org/jira/browse/YARN-5366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16254598#comment-16254598
 ] 

Eric Yang commented on YARN-5366:
---------------------------------

{{--rm}} will remove container when docker is restarted.  If a system admin 
have to upgrade docker, and accidentally deleted end user application.  This 
would have severe consequences.  There are gaps between YARN mode of operation, 
and docker mode of operation.  Let's see if we can support the additional state 
in YARN application.  This can help to guide us to translate the mapping 
correctly to docker commands.

# Application submitted - Metadata is persisted about the existence of the 
application.
# Application in queue - Application is pending for available resource.
# Application launched - Container initialized, and started.
# Application stop - Container stopped
# Application flex - Container start or container stopped is invoked.
# Application destroy - Containers removed.

The key differences between docker and YARN are YARN applications don't have 
long term accumulated state.  Where, docker container is likely to be reused 
until it is decommissioned.  For now, we have persisted yarnfile in HDFS to 
represent the state and configuration of the application by using slider code.  
Application flex and destroy are new operations that were introduced to mimic 
docker container stateful interactions.  Can we use the new flex and destroy 
operation to trigger docker command to perform clean up?  The answer is no 
currently because YARN container ID is hardwired to Docker container name.  We 
are forcing docker container to work more like YARN container that it's 
liveness is short lived.  It will disappear as soon as job is completed, failed 
or killed.

If we change reference of docker container name to application name + YARN 
container ID instead of YARN container ID, this will allow us to reuse docker 
container without clean up.  This enables us to suspend application, and resume 
later.  The application destroy command can invoke {{docker rm -f}} to clean up 
the occupied resource.

If we agree on mapping the gaps, we can try the following:

Container initialization:
{{docker pull}}

Application start/flex -> container start:
{{docker run or docker rename+docker start+attach}} . Run docker on the 
foreground only monitor the child process liveness.

Application stop -> container stop:
{{docker stop}}

Application destroy -> container cleanup:
{{docker rm -f}}

One down side of mapping YARN to behave more like docker is the docker 
container temp space may run out of space because too many suspended 
application reserved the temp space.

> Improve handling of the Docker container life cycle
> ---------------------------------------------------
>
>                 Key: YARN-5366
>                 URL: https://issues.apache.org/jira/browse/YARN-5366
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: yarn
>            Reporter: Shane Kumpf
>            Assignee: Shane Kumpf
>              Labels: oct16-medium
>         Attachments: YARN-5366.001.patch, YARN-5366.002.patch, 
> YARN-5366.003.patch, YARN-5366.004.patch, YARN-5366.005.patch, 
> YARN-5366.006.patch
>
>
> There are several paths that need to be improved with regard to the Docker 
> container lifecycle when running Docker containers on YARN.
> 1) Provide the ability to keep a container on the NodeManager for a set 
> period of time for debugging purposes.
> 2) Support sending signals to the process in the container to allow for 
> triggering stack traces, heap dumps, etc.
> 3) Support for Docker's live restore, which means moving away from the use of 
> {{docker wait}}. (YARN-5818)
> 4) Improve the resiliency of liveliness checks (kill -0) by adding retries.
> 5) Improve the resiliency of container removal by adding retries.
> 6) Only attempt to stop, kill, and remove containers if the current container 
> state allows for it.
> 7) Better handling of short lived containers when the container is stopped 
> before the PID can be retrieved. (YARN-6305)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to