[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service

Eric Yang (JIRA) Tue, 16 Oct 2018 13:56:41 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16652420#comment-16652420
 ]


Eric Yang edited comment on YARN-8489 at 10/16/18 8:55 PM:
-----------------------------------------------------------

[~leftnoteasy] {quote}We will not support notebook and distributed TF job 
running in the service. I don't hear open source community like jupyter has 
support of this (connecting to a running distributed TF job and use it as 
executor). And I didn't see TF claims to support this or plan to support.{quote}

Jupyter notebook is part of official Docker Tensorflow image, and the 
architecture is [explained|https://www.tensorflow.org/extend/architecture] in 
official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] 
document. 

Here is an example of how to run distributed tensorflow with Jupyter notebook 
on YARN service:

{code}
{
  "name": "tensorflow-service",
  "version": "1.0",
  "kerberos_principal" : {
    "principal_name" : "hbase/[email protected]",
    "keytab" : "file:///etc/security/keytabs/hbase.service.keytab"
  },
  "components" :
  [
    {
      "name": "jupyter",
      "number_of_containers": 1,
      "run_privileged_container": true,
      "artifact": {
        "id": "tensorflow/tensorflow:1.10.1",
        "type": "DOCKER"
      },
      "resource": {
        "cpus": 1,
        "memory": "256"
      },
      "configuration": {
        "env": {
          "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true"
        }
      },
      "restart_policy": "NEVER"
    },
    {
      "name": "ps",
      "number_of_containers": 1,
      "run_privileged_container": true,
      "artifact": {
        "id": "tensorflow/tensorflow:1.10.1",
        "type": "DOCKER"
      },
      "resource": {
        "cpus": 1,
        "memory": "256"
      },
      "launch_command": "python ps.py",
      "configuration": {
        "env": {
          "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
        }
      },
      "restart_policy": "NEVER"
    },
    {
      "name": "worker",
      "number_of_containers": 1,
      "run_privileged_container": true,
      "artifact": {
        "id": "tensorflow/tensorflow:1.10.1",
        "type": "DOCKER"
      },
      "launch_command": "python worker.py",
      "resource": {
        "cpus": 1,
        "memory": "256"
      },
      "configuration": {
        "env": {
          "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
        }
      },
      "restart_policy": "NEVER"
    }
  ]
}
{code}

ps.py
{code}
server = tf.train.Server(cluster,
                           job_name=FLAGS.job_name,
                           task_index=FLAGS.task_index)
server.join()
{code}

In jupyter notebook:
User can write code on the fly:
{code}
with tf.Session("grpc://worker-0.example.com:2222") as sess:
  for _ in range(10000):
    sess.run(train_op)
{code}

Isn't this the easiest way to iterate in notebook without going through 
ps/worker setup per iteration?  The only thing that user needs to write is 
worker.py which is use case driven.  Am I missing something?


was (Author: eyang):
[~leftnoteasy] {quote}We will not support notebook and distributed TF job 
running in the service. I don't hear open source community like jupyter has 
support of this (connecting to a running distributed TF job and use it as 
executor). And I didn't see TF claims to support this or plan to support.{quote}

Jupyter notebook is part of official Docker Tensorflow image, and the 
architecture is [explained|https://www.tensorflow.org/extend/architecture] in 
official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] 
document. 

Here is an example of how to run distributed tensorflow with Jupyter notebook 
on YARN service:

{code}
{
  "name": "tensorflow-service",
  "version": "1.0",
  "kerberos_principal" : {
    "principal_name" : "hbase/[email protected]",
    "keytab" : "file:///etc/security/keytabs/hbase.service.keytab"
  },
  "components" :
  [
    {
      "name": "jupyter",
      "number_of_containers": 1,
      "run_privileged_container": true,
      "artifact": {
        "id": "tensorflow/tensorflow:1.10.1",
        "type": "DOCKER"
      },
      "resource": {
        "cpus": 1,
        "memory": "256"
      },
      "configuration": {
        "env": {
          "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true"
        }
      },
      "restart_policy": "NEVER"
    },
    {
      "name": "ps",
      "number_of_containers": 1,
      "run_privileged_container": true,
      "artifact": {
        "id": "tensorflow/tensorflow:1.10.1",
        "type": "DOCKER"
      },
      "resource": {
        "cpus": 1,
        "memory": "256"
      },
      "launch_command": "python ps.py",
      "configuration": {
        "env": {
          "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
        }
      },
      "restart_policy": "NEVER"
    },
    {
      "name": "worker",
      "number_of_containers": 1,
      "run_privileged_container": true,
      "artifact": {
        "id": "tensorflow/tensorflow:1.10.1",
        "type": "DOCKER"
      },
      "launch_command": "python worker.py",
      "resource": {
        "cpus": 1,
        "memory": "256"
      },
      "configuration": {
        "env": {
          "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
        }
      },
      "restart_policy": "NEVER"
    }
  ]
}
{code}

ps.py
{code}
server = tf.train.Server(cluster,
                           job_name=FLAGS.job_name,
                           task_index=FLAGS.task_index)
server.join()
{code}

In jupyter notebook:
User can write code on the fly:
{code}
with tf.Session("grpc://worker7.example.com:2222") as sess:
  for _ in range(10000):
    sess.run(train_op)
{code}

Isn't this the easiest way to iterate in notebook without going through 
ps/worker setup per iteration?  The only thing that user needs to write is 
worker.py which is use case driven.  Am I missing something?

> Need to support "dominant" component concept inside YARN service
> ----------------------------------------------------------------
>
>                 Key: YARN-8489
>                 URL: https://issues.apache.org/jira/browse/YARN-8489
>             Project: Hadoop YARN
>          Issue Type: Task
>          Components: yarn-native-services
>            Reporter: Wangda Tan
>            Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service

Reply via email to