[ 
https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464158#comment-16464158
 ] 

Eric Yang commented on YARN-8080:
---------------------------------

[~suma.shivaprasad] Thank you for the patch.

Flex is a black box operation, it is not context aware of how application 
requires more or less containers.  Therefore, it reliant on the user/program to 
make decision.  Here are the possible usage of each case:

Retry policy = NEVER and Flex Up
A data scientist might be training datasets and found that the dataset produced 
by the first two completed container is insufficient, and he like to get more 
iteration to train on the same dataset.  The input parameters could stay the 
same, but perform more of the same iterations in parallel.  Flex operation can 
come in handy that flex up to reach the desired state of 4 containers (2 
currently running and 2 additional containers).  This can produce more data 
model for him in the same run.

Retry policy = NEVER and Flex down
When system administrator ask data scientist to save system resources for his 
bitcoin mining operation.  Flex down could mean to save system resources and 
perform ML training iterations at a later run.  

Retry policy = ON_FAILURE and Flex Up
In the case where container workload are stateful, such as SparkSQL that 
translated query into multiple partitions.  SparkSQL driver can decide if it 
wants to attempt multiple retries on failure with smaller dataset to ensure 
query completion.  It may decide to increase the number of containers, and 
change some hint file on hdfs to reduce the workload computed per container and 
increase number of containers to complete the query computation.

Retry policy = ON_FAILURE and Flex down
In some case, merging data from many partitions at the same time, it might have 
unbalanced dataset, and prevent merging from happening.  SparkSQL driver might 
decide to use alternate technique to merge using few containers.  In this case, 
Yarn Service AM reduce the container count, and let Spark executor program to 
communicate directly with Spark driver program to compute by alternate strategy.

There are possible use cases for each of the scenario, and we provide the knobs 
to enable each scenario.  There are some additional programming from 
application point of view to take advantage of the advance feature.  I also 
agree that some stateful program might not work in combinations of retry policy 
and flex operation, and we provide a option to disable flex for such type of 
program.

> YARN native service should support component restart policy
> -----------------------------------------------------------
>
>                 Key: YARN-8080
>                 URL: https://issues.apache.org/jira/browse/YARN-8080
>             Project: Hadoop YARN
>          Issue Type: Task
>            Reporter: Wangda Tan
>            Assignee: Suma Shivaprasad
>            Priority: Critical
>         Attachments: YARN-8080.001.patch, YARN-8080.002.patch, 
> YARN-8080.003.patch, YARN-8080.005.patch, YARN-8080.006.patch, 
> YARN-8080.007.patch
>
>
> Existing native service assumes the service is long running and never 
> finishes. Containers will be restarted even if exit code == 0. 
> To support boarder use cases, we need to allow restart policy of component 
> specified by users. Propose to have following policies:
> 1) Always: containers always restarted by framework regardless of container 
> exit status. This is existing/default behavior.
> 2) Never: Do not restart containers in any cases after container finishes: To 
> support job-like workload (for example Tensorflow training job). If a task 
> exit with code == 0, we should not restart the task. This can be used by 
> services which is not restart/recovery-able.
> 3) On-failure: Similar to above, only restart task with exitcode != 0. 
> Behaviors after component *instance* finalize (Succeeded or Failed when 
> restart_policy != ALWAYS): 
> 1) For single component, single instance: complete service.
> 2) For single component, multiple instance: other running instances from the 
> same component won't be affected by the finalized component instance. Service 
> will be terminated once all instances finalized. 
> 3) For multiple components: Service will be terminated once all components 
> finalized.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to