[ https://issues.apache.org/jira/browse/YARN-8080?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16464158#comment-16464158 ]
Eric Yang commented on YARN-8080: --------------------------------- [~suma.shivaprasad] Thank you for the patch. Flex is a black box operation, it is not context aware of how application requires more or less containers. Therefore, it reliant on the user/program to make decision. Here are the possible usage of each case: Retry policy = NEVER and Flex Up A data scientist might be training datasets and found that the dataset produced by the first two completed container is insufficient, and he like to get more iteration to train on the same dataset. The input parameters could stay the same, but perform more of the same iterations in parallel. Flex operation can come in handy that flex up to reach the desired state of 4 containers (2 currently running and 2 additional containers). This can produce more data model for him in the same run. Retry policy = NEVER and Flex down When system administrator ask data scientist to save system resources for his bitcoin mining operation. Flex down could mean to save system resources and perform ML training iterations at a later run. Retry policy = ON_FAILURE and Flex Up In the case where container workload are stateful, such as SparkSQL that translated query into multiple partitions. SparkSQL driver can decide if it wants to attempt multiple retries on failure with smaller dataset to ensure query completion. It may decide to increase the number of containers, and change some hint file on hdfs to reduce the workload computed per container and increase number of containers to complete the query computation. Retry policy = ON_FAILURE and Flex down In some case, merging data from many partitions at the same time, it might have unbalanced dataset, and prevent merging from happening. SparkSQL driver might decide to use alternate technique to merge using few containers. In this case, Yarn Service AM reduce the container count, and let Spark executor program to communicate directly with Spark driver program to compute by alternate strategy. There are possible use cases for each of the scenario, and we provide the knobs to enable each scenario. There are some additional programming from application point of view to take advantage of the advance feature. I also agree that some stateful program might not work in combinations of retry policy and flex operation, and we provide a option to disable flex for such type of program. > YARN native service should support component restart policy > ----------------------------------------------------------- > > Key: YARN-8080 > URL: https://issues.apache.org/jira/browse/YARN-8080 > Project: Hadoop YARN > Issue Type: Task > Reporter: Wangda Tan > Assignee: Suma Shivaprasad > Priority: Critical > Attachments: YARN-8080.001.patch, YARN-8080.002.patch, > YARN-8080.003.patch, YARN-8080.005.patch, YARN-8080.006.patch, > YARN-8080.007.patch > > > Existing native service assumes the service is long running and never > finishes. Containers will be restarted even if exit code == 0. > To support boarder use cases, we need to allow restart policy of component > specified by users. Propose to have following policies: > 1) Always: containers always restarted by framework regardless of container > exit status. This is existing/default behavior. > 2) Never: Do not restart containers in any cases after container finishes: To > support job-like workload (for example Tensorflow training job). If a task > exit with code == 0, we should not restart the task. This can be used by > services which is not restart/recovery-able. > 3) On-failure: Similar to above, only restart task with exitcode != 0. > Behaviors after component *instance* finalize (Succeeded or Failed when > restart_policy != ALWAYS): > 1) For single component, single instance: complete service. > 2) For single component, multiple instance: other running instances from the > same component won't be affected by the finalized component instance. Service > will be terminated once all instances finalized. > 3) For multiple components: Service will be terminated once all components > finalized. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org