[
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652453#comment-16652453
]
Wangda Tan commented on YARN-8489:
----------------------------------
[~eyang],
This is bit different from Spark executors.
For Spark, from external view, it is a fully managed service, which can run
tasks inside the Spark executors. Livy is just responsible to send code to
Spark service and wait the result.
For TF, PS can be deployed outside of workers like what you shown, but
computation is still executed inside worker. In your example, it is inside the
notebook.
The separated PS deployment is not a widely used feature, AFAIK, only Google
internally deploys in that way, part of the reason is they have super large
models require distributed PS.
The separate PS deployment approach is not easy to manage, need user to modify
their source code, etc. And for most of the use cases, people avoid the
distributed model given it is very hard to manage, serving, etc.
After talked to many companies, for Submarine, in a short to mid term, I prefer
to only support PS within each job.
To your concern :
{quote}Isn't this the easiest way to iterate in notebook without going through
ps/worker setup per iteration? The only thing that user needs to write is
worker.py which is use case driven. Am I missing something?
{quote}
The easiest way is not to handle PS at all from the notebook, user can choose
Keras, etc. to build their model inside notebook. Handling separate logics
inside notebook for PS is just an overhead to users.
> Need to support "dominant" component concept inside YARN service
> ----------------------------------------------------------------
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
> Issue Type: Task
> Components: yarn-native-services
> Reporter: Wangda Tan
> Priority: Major
>
> Existing YARN service support termination policy for different restart
> policies. For example ALWAYS means service will not be terminated. And NEVER
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better
> names. But in simple, it means, a dominant component which final state will
> determine job's final state regardless of other components.
> Use cases:
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to
> final state, no matter if it is succeeded or failed, we should terminate
> ps/tensorboard/workers. And the mark the job to succeeded/failed.
> 2) Not sure if it is a real-world use case: A service which has multiple
> component, some component is not restartable. For such services, if a
> component is failed, we should mark the whole service to failed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]