[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652547#comment-16652547
 ] 

Eric Yang commented on YARN-8489:
---------------------------------

[~leftnoteasy] Notebook can communicate to ps or workers via grpc the same.  
The example was trying to grpc access to a worker instead of making assumption 
that notebook is PS.  PS helps to build the task that workers are going to 
execute more efficiently.  Data scientist specify the cluster spec in notebook, 
parameter server partitions the models and tasks to increase workers 
effectiveness.   

We digressed from original goal of this JIRA.  My point is dependency 
expression and refine YARN service state machine can achieve what you are 
proposing with additional switch.  Additional switch may have unforeseen 
consequence to existing operations.  For example, what happen if during upgrade 
the dominant component is offline.  Should the service terminate and clean up?  
How about flex dominant component to lesser nodes?  What is the order to 
evaluate dominant component and component dependencies?  How to handle restart 
policy in place of dominant component?  It would be helpful to draw a state 
diagram to explain the proposal to see if this idea is worth pursuing. 

> Need to support "dominant" component concept inside YARN service
> ----------------------------------------------------------------
>
>                 Key: YARN-8489
>                 URL: https://issues.apache.org/jira/browse/YARN-8489
>             Project: Hadoop YARN
>          Issue Type: Task
>          Components: yarn-native-services
>            Reporter: Wangda Tan
>            Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to