[ 
https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16463097#comment-16463097
 ] 

Suma Shivaprasad edited comment on YARN-8080 at 5/4/18 5:04 AM:
----------------------------------------------------------------

Thanks [~gsaha] for reviews and offline discussions on the patch.  While 
testing the flex scenarios as suggested by [~gsaha] with the patch, ran into 
the following issues.


What does flexing  up/down a component with "restart_policy" : NEVER/ 
"restart_policy: ON_FAILURE mean?

Consider the following scenario where a component has 4 instances configured 
and restart_policy="NEVER". Assume that 2 of these containers have exited 
successfully after execution and 2 are still running.

1. Flex up
Now if the user , flexes the number of containers to 3, should we even support 
flexing up of containers in this case? For eg: It could be a Tensorflow DAG - 
YARN-8135 in which flexing up may or may not make sense unless the Tensorflow 
client needs more resources and is able to make use of the newly allocated 
containers  (like the dynamic allocation usecase in SPARK ).  [~leftnoteasy] 
could comment on this. We could add support for a flag in the YARN service spec 
to disallow/allow flexing for services and user can choose to disallow this for 
specific apps.

2. Flex down
Also flex down for such services needs to consider the current number of 
running containers (instead of configured number of containers which is the 
behaviour currently) and scale them down accordingly. For eg: if component 
instance during flex is set to 1, bring down the number of running containers 
to 1.

[~billie.rinaldi] [~leftnoteasy] [~gsaha] [~eyang] Thoughts?










was (Author: suma.shivaprasad):
Thanks [~gsaha] for reviews and offline discussions on the patch.  While 
testing the flex scenarios as suggested by [~gsaha] with the patch, ran into 
the following issues.


What does flexing  up/down a component with "restart_policy" : NEVER/ 
"restart_policy: ON_FAILURE mean?

Consider the following scenario where a component has 4 instances configured 
and restart_policy="NEVER". Assume that 2 of these containers have exited 
successfully after execution and 2 are still running.

1. Flex up
Now if the user , flexes the number of containers to 3, should we even support 
flexing up of containers in this case? For eg: It could be a Tensorflow DAG - 
YARN-8135 in which flexing up may or may not make sense unless the Tensorflow 
client needs more resources is able to make use of the newly allocated 
containers  (like the dynamic allocation usecase in SPARK ).  [~leftnoteasy] 
could comment on this. We could add support for a flag in the YARN service spec 
to disallow/allow flexing for services and user can choose to disallow this for 
specific apps.

2. Flex down
Also flex down for such services needs to consider the current number of 
running containers (instead of configured number of containers which is the 
behaviour currently) and scale them down accordingly. For eg: if component 
instance during flex is set to 1, bring down the number of running containers 
to 1.

[~billie.rinaldi] [~leftnoteasy] [~gsaha] [~eyang] Thoughts?









> YARN native service should support component restart policy
> -----------------------------------------------------------
>
>                 Key: YARN-8080
>                 URL: https://issues.apache.org/jira/browse/YARN-8080
>             Project: Hadoop YARN
>          Issue Type: Task
>            Reporter: Wangda Tan
>            Assignee: Suma Shivaprasad
>            Priority: Critical
>         Attachments: YARN-8080.001.patch, YARN-8080.002.patch, 
> YARN-8080.003.patch, YARN-8080.005.patch, YARN-8080.006.patch, 
> YARN-8080.007.patch
>
>
> Existing native service assumes the service is long running and never 
> finishes. Containers will be restarted even if exit code == 0. 
> To support boarder use cases, we need to allow restart policy of component 
> specified by users. Propose to have following policies:
> 1) Always: containers always restarted by framework regardless of container 
> exit status. This is existing/default behavior.
> 2) Never: Do not restart containers in any cases after container finishes: To 
> support job-like workload (for example Tensorflow training job). If a task 
> exit with code == 0, we should not restart the task. This can be used by 
> services which is not restart/recovery-able.
> 3) On-failure: Similar to above, only restart task with exitcode != 0. 
> Behaviors after component *instance* finalize (Succeeded or Failed when 
> restart_policy != ALWAYS): 
> 1) For single component, single instance: complete service.
> 2) For single component, multiple instance: other running instances from the 
> same component won't be affected by the finalized component instance. Service 
> will be terminated once all instances finalized. 
> 3) For multiple components: Service will be terminated once all components 
> finalized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to