[
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652420#comment-16652420
]
Eric Yang edited comment on YARN-8489 at 10/16/18 8:49 PM:
-----------------------------------------------------------
[~leftnoteasy] {quote}We will not support notebook and distributed TF job
running in the service. I don't hear open source community like jupyter has
support of this (connecting to a running distributed TF job and use it as
executor). And I didn't see TF claims to support this or plan to support.{quote}
Jupyter notebook is part of official Docker Tensorflow image, and the
architecture is [explained|https://www.tensorflow.org/extend/architecture] in
official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed]
document.
Here is an example of how to run distributed tensorflow with Jupyter notebook
on YARN service:
{code}
{
"name": "tensorflow-service",
"version": "1.0",
"kerberos_principal" : {
"principal_name" : "hbase/[email protected]",
"keytab" : "file:///etc/security/keytabs/hbase.service.keytab"
},
"components" :
[
{
"name": "jupyter",
"number_of_containers": 1,
"run_privileged_container": true,
"artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
},
"resource": {
"cpus": 1,
"memory": "256"
},
"configuration": {
"env": {
"YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true"
}
},
"restart_policy": "NEVER"
},
{
"name": "ps",
"number_of_containers": 1,
"run_privileged_container": true,
"artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
},
"resource": {
"cpus": 1,
"memory": "256"
},
"launch_command": "python ps.py",
"configuration": {
"env": {
"YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
}
},
"restart_policy": "NEVER"
},
{
"name": "worker",
"number_of_containers": 1,
"run_privileged_container": true,
"artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
},
"launch_command": "python worker.py",
"resource": {
"cpus": 1,
"memory": "256"
},
"configuration": {
"env": {
"YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
}
},
"restart_policy": "NEVER"
}
]
}
{code}
ps.py
{code}
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index)
server.join()
{code}
In jupyter notebook:
User can write code on the fly:
{code}
with tf.Session("grpc://worker7.example.com:2222") as sess:
for _ in range(10000):
sess.run(train_op)
{code}
Isn't this the easiest way to iterate in notebook without going through
ps/worker setup per iteration? The only thing that user needs to write is
worker.py which is use case driven. Am I missing something?
was (Author: eyang):
[~leftnoteasy] {quote}We will not support notebook and distributed TF job
running in the service. I don't hear open source community like jupyter has
support of this (connecting to a running distributed TF job and use it as
executor). And I didn't see TF claims to support this or plan to support.{quote}
Jupyter notebook is part of official Docker Tensorflow image, and this is
[explained|https://www.tensorflow.org/extend/architecture] in official
[distributed Tensorflow|https://www.tensorflow.org/deploy/distributed]
document.
Here is an example of how to run distributed tensorflow with Jupyter notebook
on YARN service:
{code}
{
"name": "tensorflow-service",
"version": "1.0",
"kerberos_principal" : {
"principal_name" : "hbase/[email protected]",
"keytab" : "file:///etc/security/keytabs/hbase.service.keytab"
},
"components" :
[
{
"name": "jupyter",
"number_of_containers": 1,
"run_privileged_container": true,
"artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
},
"resource": {
"cpus": 1,
"memory": "256"
},
"configuration": {
"env": {
"YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true"
}
},
"restart_policy": "NEVER"
},
{
"name": "ps",
"number_of_containers": 1,
"run_privileged_container": true,
"artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
},
"resource": {
"cpus": 1,
"memory": "256"
},
"launch_command": "python ps.py",
"configuration": {
"env": {
"YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
}
},
"restart_policy": "NEVER"
},
{
"name": "worker",
"number_of_containers": 1,
"run_privileged_container": true,
"artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
},
"launch_command": "python worker.py",
"resource": {
"cpus": 1,
"memory": "256"
},
"configuration": {
"env": {
"YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
}
},
"restart_policy": "NEVER"
}
]
}
{code}
ps.py
{code}
server = tf.train.Server(cluster,
job_name=FLAGS.job_name,
task_index=FLAGS.task_index)
server.join()
{code}
In jupyter notebook:
User can write code on the fly:
{code}
with tf.Session("grpc://worker7.example.com:2222") as sess:
for _ in range(10000):
sess.run(train_op)
{code}
Isn't this the easiest way to iterate in notebook without going through
ps/worker setup per iteration? The only thing that user needs to write is
worker.py which is use case driven. Am I missing something?
> Need to support "dominant" component concept inside YARN service
> ----------------------------------------------------------------
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
> Issue Type: Task
> Components: yarn-native-services
> Reporter: Wangda Tan
> Priority: Major
>
> Existing YARN service support termination policy for different restart
> policies. For example ALWAYS means service will not be terminated. And NEVER
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better
> names. But in simple, it means, a dominant component which final state will
> determine job's final state regardless of other components.
> Use cases:
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to
> final state, no matter if it is succeeded or failed, we should terminate
> ps/tensorboard/workers. And the mark the job to succeeded/failed.
> 2) Not sure if it is a real-world use case: A service which has multiple
> component, some component is not restartable. For such services, if a
> component is failed, we should mark the whole service to failed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]