[ 
https://issues.apache.org/jira/browse/YARN-4876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15330496#comment-15330496
 ] 

Arun Suresh commented on YARN-4876:
-----------------------------------

Aggregating and posting some design points on the patch based on offline 
discussions with [~marco.rabozzi] :

h4. ContainerImpl state machine
In the current patch, containers that are initialized using the new 
initializeContainers APIs keep waiting for startContainers requests within the 
LOCALIZED state after resource localization. When the START_CONTAINER event is 
generated upon request from the application master, the container transits to a 
new LAUNCHING state waiting for a CONTAINER_LAUNCHED event (this is fired 
asynchronously by ContainerLaunch when the container process is being started). 
Upon receiving the CONTAINER_LAUNCHED event, the container state is updated to 
RUNNING. For containers that do not allow multi-start (i.e. those that are 
initialized and started using the standard startContainers API), the 
START_CONTAINER event is automatically sent after localization.

The role of the new “LAUNCHING” state is to make a clear distinction between 
the following two situations:
# The container has been localized and is waiting for a start request 
(LOCALIZED state)
# The container has received a start request and it is being started (LAUNCHING 
state)
In this fashion, we can allow a start (or a restart) of an idle container only 
if the container is in the LOCALIZED state and if it allows multi-start. 

>From a first analysis, it seems that the new LAUNCHING state and the already 
>present RELAUNCHING state could by merged into a single LAUNCHING state to 
>reduce the state machine complexity.

The destroyContainers API is equivalent to stopContainers if the specified 
containers do not allow multi-start. On the other hand, in case of a container 
that allows multi-start, the stopContainers API kills the container process and 
reverts the container state machine to “LOCALIZED”. However, in order to 
properly catch the termination of a container process for which a stop request 
has been issued, an additional “STOPPING” state has been inserted. If the 
container is in RUNNING state and it allows multi-start, the application master 
can issue a stopContainers request upon which the container state is updated to 
STOPPING and an asynchronous request to kill the container process is sent. 
Within the stopping state, similarly to the KILLING state, the container 
termination events (CONTAINER_EXITED_WITH_SUCCESS, CONTAINER_KILLED_ON_REQUEST, 
CONTAINER_EXITED_WITH_FAILURE) are considered as a successful container stop, 
upon which the container state reverts to LOCALIZED.

h4. Working directory cleanup
When a container is in the LOCALIZED state and multi-start is enabled, the 
application master can issue the following 3 new types of requests:
# StartContainers (ContainerLaunchContext == NULL)
# InitializeContainers
# StartContainers (ContainerLaunchContext != NULL)

In case 1) the container is simply started using the ContainerLaunchContext 
issued in the previous InitializeContainers request (the state machine 
transitions for this case are the ones described in the previous section). Case 
2) and 3) both perform reinitialization and relocalization of container 
resources, the only difference between 2) and 3) is that in 3) the container is 
also started after relocalization. Currently, when the container is 
reinitialized, the container working directory is deleted to ensure a clean 
state for the subsequent container starts. Actually, we could relax this 
behavior and allow the application master to specify a deletion policy for 
container reinitialization. Depending on the requirements we might want to 
address this aspect here or in a follow up JIRA.

h4. Log handling
Currently, there is no special handling of logs for a restarted container. The 
application master can decide either to append the new logs to the old ones or 
overwrite the old logs. This can be simply achieved by changing the launch 
command (e.g. in Linux use “>>” to append and “>” to overwrite).

h4. Token expiration
Both the InitializeContainers and the StartContainers APIs require a container 
token to authorize the request. For long running containers, the token might 
expire and the application master won’t be able to request a restart or a 
reinitialization of a container. This limitation currently holds also for the 
IncreaseContainerResource API. We might need to address container token renewal 
in a separated JIRA.

h4. Recovery for container that allows multi-start
The current patch does not fully support recovery of containers that allows 
multi-start. Indeed, after a restart of the NodeManager, if the container is 
not running, the NodeManager cannot distinguish between a stopped container 
waiting for start or a container that completed its execution successfully. 
Additional information in the state store might be needed to handle this case.

h4. Auxiliary Service Data
In the current YARN implementation, a CONTAINER_INIT and a APPLICATION_INIT 
events are sent to the auxiliary services every time a new container is 
initialized. With the new initializeContainers API, it is possible to 
reinitialized a container multiple times even without actually starting it. The 
actual implementation of the patch sends a CONTAINER_INIT and an 
APPLICATION_INIT event for every reinitialization of a container (potentially 
sending new data to the auxiliary services). We should verify weather this 
behavior is correct or needs to be modified.

h4. Container failures handling
In the current patch implementation, if a container fails during a 
reinitialization, the container is destroyed. On the other hand, if the 
container fails within the STOPPING state, this is considered as a successful 
stop. Should we allow the application master to specify a policy for failures 
behaviors for stopping and reinitializing?

h4. Destroy container monitor
The proposed patch allows the application master to specify a destroyDelay 
after which an idle container is destroyed automatically if not started within 
a given timeout. The destroy logic is still not implemented in the current 
patch. We might need to implement a “destroy containers monitor” service to 
check for container to destroy after a configurable time interval. 

h4. Uploaded resource
During container relocalization, do we need specific logic for resources that 
are uploaded to the shared cache? Currently, before localizing the new 
resources, the old container local resources are released. Do we have to clean 
also the resourcesUploadPolicies map of ContainerImpl during relocalization?


> [Phase 1] Decoupled Init / Destroy of Containers from Start / Stop
> ------------------------------------------------------------------
>
>                 Key: YARN-4876
>                 URL: https://issues.apache.org/jira/browse/YARN-4876
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Arun Suresh
>            Assignee: Marco Rabozzi
>         Attachments: YARN-4876-design-doc.pdf, YARN-4876.002.patch, 
> YARN-4876.01.patch
>
>
> Introduce *initialize* and *destroy* container API into the 
> *ContainerManagementProtocol* and decouple the actual start of a container 
> from the initialization. This will allow AMs to re-start a container without 
> having to lose the allocation.
> Additionally, if the localization of the container is associated to the 
> initialize (and the cleanup with the destroy), This can also be used by 
> applications to upgrade a Container by *re-initializing* with a new 
> *ContainerLaunchContext*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to