[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652453#comment-16652453
 ] 

Wangda Tan commented on YARN-8489:
----------------------------------

[~eyang], 

This is bit different from Spark executors. 

For Spark, from external view, it is a fully managed service, which can run 
tasks inside the Spark executors. Livy is just responsible to send code to 
Spark service and wait the result.

For TF, PS can be deployed outside of workers like what you shown, but 
computation is still executed inside worker. In your example, it is inside the 
notebook. 

The separated PS deployment is not a widely used feature, AFAIK, only Google 
internally deploys in that way, part of the reason is they have super large 
models require distributed PS.

The separate PS deployment approach is not easy to manage, need user to modify 
their source code, etc. And for most of the use cases, people avoid the 
distributed model given it is very hard to manage, serving, etc.

 

After talked to many companies, for Submarine, in a short to mid term, I prefer 
to only support PS within each job.

To your concern :

 
{quote}Isn't this the easiest way to iterate in notebook without going through 
ps/worker setup per iteration? The only thing that user needs to write is 
worker.py which is use case driven. Am I missing something?
{quote}
The easiest way is not to handle PS at all from the notebook, user can choose 
Keras, etc. to build their model inside notebook. Handling separate logics 
inside notebook for PS is just an overhead to users.

> Need to support "dominant" component concept inside YARN service
> ----------------------------------------------------------------
>
>                 Key: YARN-8489
>                 URL: https://issues.apache.org/jira/browse/YARN-8489
>             Project: Hadoop YARN
>          Issue Type: Task
>          Components: yarn-native-services
>            Reporter: Wangda Tan
>            Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to