[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716694#comment-16716694 ] Zac Zhou edited comment on YARN-8489 at 12/11/18 9:47 AM: -- [~suma.shivaprasad] any Updates? Or would you mind if I take it, as this jira blocks terminating submarine job gracefully. was (Author: yuan_zac): @[~suma.shivaprasad] any Updates? Or would you mind if I take it, as this jira blocks terminating submarine job gracefully. > Need to support "dominant" component concept inside YARN service > > > Key: YARN-8489 > URL: https://issues.apache.org/jira/browse/YARN-8489 > Project: Hadoop YARN > Issue Type: Task > Components: yarn-native-services >Reporter: Wangda Tan >Assignee: Suma Shivaprasad >Priority: Major > > Existing YARN service support termination policy for different restart > policies. For example ALWAYS means service will not be terminated. And NEVER > means if all component terminated, service will be terminated. > The name "dominant" might not be most appropriate , we can figure out better > names. But in simple, it means, a dominant component which final state will > determine job's final state regardless of other components. > Use cases: > 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to > final state, no matter if it is succeeded or failed, we should terminate > ps/tensorboard/workers. And the mark the job to succeeded/failed. > 2) Not sure if it is a real-world use case: A service which has multiple > component, some component is not restartable. For such services, if a > component is failed, we should mark the whole service to failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652420#comment-16652420 ] Eric Yang edited comment on YARN-8489 at 10/16/18 8:55 PM: --- [~leftnoteasy] {quote}We will not support notebook and distributed TF job running in the service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support.{quote} Jupyter notebook is part of official Docker Tensorflow image, and the architecture is [explained|https://www.tensorflow.org/extend/architecture] in official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] document. Here is an example of how to run distributed tensorflow with Jupyter notebook on YARN service: {code} { "name": "tensorflow-service", "version": "1.0", "kerberos_principal" : { "principal_name" : "hbase/_h...@example.com", "keytab" : "file:///etc/security/keytabs/hbase.service.keytab" }, "components" : [ { "name": "jupyter", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true" } }, "restart_policy": "NEVER" }, { "name": "ps", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "launch_command": "python ps.py", "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false" } }, "restart_policy": "NEVER" }, { "name": "worker", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "launch_command": "python worker.py", "resource": { "cpus": 1, "memory": "256" }, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false" } }, "restart_policy": "NEVER" } ] } {code} ps.py {code} server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index) server.join() {code} In jupyter notebook: User can write code on the fly: {code} with tf.Session("grpc://worker-0.example.com:") as sess: for _ in range(1): sess.run(train_op) {code} Isn't this the easiest way to iterate in notebook without going through ps/worker setup per iteration? The only thing that user needs to write is worker.py which is use case driven. Am I missing something? was (Author: eyang): [~leftnoteasy] {quote}We will not support notebook and distributed TF job running in the service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support.{quote} Jupyter notebook is part of official Docker Tensorflow image, and the architecture is [explained|https://www.tensorflow.org/extend/architecture] in official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] document. Here is an example of how to run distributed tensorflow with Jupyter notebook on YARN service: {code} { "name": "tensorflow-service", "version": "1.0", "kerberos_principal" : { "principal_name" : "hbase/_h...@example.com", "keytab" : "file:///etc/security/keytabs/hbase.service.keytab" }, "components" : [ { "name": "jupyter", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true" } }, "restart_policy": "NEVER" }, { "name": "ps", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "launch_command": "python ps.py", "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false" } }, "restart_policy": "NEVER" }, {
[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652420#comment-16652420 ] Eric Yang edited comment on YARN-8489 at 10/16/18 8:49 PM: --- [~leftnoteasy] {quote}We will not support notebook and distributed TF job running in the service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support.{quote} Jupyter notebook is part of official Docker Tensorflow image, and the architecture is [explained|https://www.tensorflow.org/extend/architecture] in official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] document. Here is an example of how to run distributed tensorflow with Jupyter notebook on YARN service: {code} { "name": "tensorflow-service", "version": "1.0", "kerberos_principal" : { "principal_name" : "hbase/_h...@example.com", "keytab" : "file:///etc/security/keytabs/hbase.service.keytab" }, "components" : [ { "name": "jupyter", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true" } }, "restart_policy": "NEVER" }, { "name": "ps", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "launch_command": "python ps.py", "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false" } }, "restart_policy": "NEVER" }, { "name": "worker", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "launch_command": "python worker.py", "resource": { "cpus": 1, "memory": "256" }, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false" } }, "restart_policy": "NEVER" } ] } {code} ps.py {code} server = tf.train.Server(cluster, job_name=FLAGS.job_name, task_index=FLAGS.task_index) server.join() {code} In jupyter notebook: User can write code on the fly: {code} with tf.Session("grpc://worker7.example.com:") as sess: for _ in range(1): sess.run(train_op) {code} Isn't this the easiest way to iterate in notebook without going through ps/worker setup per iteration? The only thing that user needs to write is worker.py which is use case driven. Am I missing something? was (Author: eyang): [~leftnoteasy] {quote}We will not support notebook and distributed TF job running in the service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support.{quote} Jupyter notebook is part of official Docker Tensorflow image, and this is [explained|https://www.tensorflow.org/extend/architecture] in official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] document. Here is an example of how to run distributed tensorflow with Jupyter notebook on YARN service: {code} { "name": "tensorflow-service", "version": "1.0", "kerberos_principal" : { "principal_name" : "hbase/_h...@example.com", "keytab" : "file:///etc/security/keytabs/hbase.service.keytab" }, "components" : [ { "name": "jupyter", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true" } }, "restart_policy": "NEVER" }, { "name": "ps", "number_of_containers": 1, "run_privileged_container": true, "artifact": { "id": "tensorflow/tensorflow:1.10.1", "type": "DOCKER" }, "resource": { "cpus": 1, "memory": "256" }, "launch_command": "python ps.py", "configuration": { "env": { "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false" } }, "restart_policy": "NEVER" }, { "name":
[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652271#comment-16652271 ] Wangda Tan edited comment on YARN-8489 at 10/16/18 7:42 PM: [~eyang], Basically there're four modes in submarine for training jobs. 1) A single node notebook runs single node TF training: User has a single node notebook which can do whatever they want. TF job runs inside the notebook, and not visible by submarine. 2) A single node notebook launches distributed TF training: Even this doesn't exist today, but it could be possible to be supported in the future. Such as adding submarine intercepter to Zeppelin. However, the notebook service and TF jobs are not belong to the same service, so this statement is not true: {quote} It would be bad user experience, if jupyter notebook and all work suddenly disappear when one ps server failed. {quote} 3) Distributed TF job w/o notebook. 4) Single node TF job w/o notebook. We will not support notebook and distributed TF job running in the same service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support. And even if TF/notebook community support this case next year or so, notebook and executors should belong to two separate services just like relationship between Jupyter / Spark. was (Author: leftnoteasy): [~eyang], Basically there're four modes in submarine for training jobs. 1) A single node notebook runs single node TF training: User has a single node notebook which can do whatever they want. TF job runs inside the notebook, and not visible by submarine. 2) A single node notebook launches distributed TF training: Even this doesn't exist today, but it could be possible to be supported in the future. Such as adding submarine intercepter to Zeppelin. However, the notebook service and TF jobs are not belong to the same service, so this statement is not true: {quote} It would be bad user experience, if jupyter notebook and all work suddenly disappear when one ps server failed. {quote} 3) Distributed TF job w/o notebook. 4) Single node TF job w/o notebook. We will not support notebook and distributed TF job running in the service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support. And even if TF/notebook community support this case, notebook and executors should belong to two separate services just like relationship between Jupyter / Spark. > Need to support "dominant" component concept inside YARN service > > > Key: YARN-8489 > URL: https://issues.apache.org/jira/browse/YARN-8489 > Project: Hadoop YARN > Issue Type: Task > Components: yarn-native-services >Reporter: Wangda Tan >Priority: Major > > Existing YARN service support termination policy for different restart > policies. For example ALWAYS means service will not be terminated. And NEVER > means if all component terminated, service will be terminated. > The name "dominant" might not be most appropriate , we can figure out better > names. But in simple, it means, a dominant component which final state will > determine job's final state regardless of other components. > Use cases: > 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to > final state, no matter if it is succeeded or failed, we should terminate > ps/tensorboard/workers. And the mark the job to succeeded/failed. > 2) Not sure if it is a real-world use case: A service which has multiple > component, some component is not restartable. For such services, if a > component is failed, we should mark the whole service to failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652271#comment-16652271 ] Wangda Tan edited comment on YARN-8489 at 10/16/18 7:41 PM: [~eyang], Basically there're four modes in submarine for training jobs. 1) A single node notebook runs single node TF training: User has a single node notebook which can do whatever they want. TF job runs inside the notebook, and not visible by submarine. 2) A single node notebook launches distributed TF training: Even this doesn't exist today, but it could be possible to be supported in the future. Such as adding submarine intercepter to Zeppelin. However, the notebook service and TF jobs are not belong to the same service, so this statement is not true: {quote} It would be bad user experience, if jupyter notebook and all work suddenly disappear when one ps server failed. {quote} 3) Distributed TF job w/o notebook. 4) Single node TF job w/o notebook. We will not support notebook and distributed TF job running in the service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support. And even if TF/notebook community support this case, notebook and executors should belong to two separate services just like relationship between Jupyter / Spark. was (Author: leftnoteasy): [~eyang], Basically there're four models in submarine for training jobs. 1) A single node notebook runs single node TF training: User has a single node notebook which can do whatever they want. TF job runs inside the notebook, and not visible by submarine. 2) A single node notebook launches distributed TF training: Even this doesn't exist today, but it could be possible to be supported in the future. Such as adding submarine intercepter to Zeppelin. However, the notebook service and TF jobs are not belong to the same service, so this statement is not true: {quote} It would be bad user experience, if jupyter notebook and all work suddenly disappear when one ps server failed. {quote} 3) Distributed TF job w/o notebook. 4) Single node TF job w/o notebook. We will not support notebook and distributed TF job running in the service. I don't hear open source community like jupyter has support of this (connecting to a running distributed TF job and use it as executor). And I didn't see TF claims to support this or plan to support. And even if TF/notebook community support this case, notebook and executors should belong to two separate services just like relationship between Jupyter / Spark. > Need to support "dominant" component concept inside YARN service > > > Key: YARN-8489 > URL: https://issues.apache.org/jira/browse/YARN-8489 > Project: Hadoop YARN > Issue Type: Task > Components: yarn-native-services >Reporter: Wangda Tan >Priority: Major > > Existing YARN service support termination policy for different restart > policies. For example ALWAYS means service will not be terminated. And NEVER > means if all component terminated, service will be terminated. > The name "dominant" might not be most appropriate , we can figure out better > names. But in simple, it means, a dominant component which final state will > determine job's final state regardless of other components. > Use cases: > 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to > final state, no matter if it is succeeded or failed, we should terminate > ps/tensorboard/workers. And the mark the job to succeeded/failed. > 2) Not sure if it is a real-world use case: A service which has multiple > component, some component is not restartable. For such services, if a > component is failed, we should mark the whole service to failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service
[ https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650876#comment-16650876 ] Eric Yang edited comment on YARN-8489 at 10/15/18 10:16 PM: We might be able to refine our existing definitions to enable this without defining additional restart policy or state. If a service has two components defined, component A and B. B depends on A. Component A restart_policy=NEVER. If component A failed, AM will toggle component A state to FLEXING, and component B continues to run. Service is most likely not working anymore when it reached this state. We may want to shutdown the service to match the expected behavior in this JIRA. was (Author: eyang): We might be able to refine our existing definitions to enable this without defining additional restart policy or state. If a service has two components defined, component A and B. B depends on A. Component A restart_policy=NEVER. If component A failed, AM will toggle component A state to FLEXING, and component B continues to run. Service is most likely not working anymore when it reach this state. We may want to shutdown the service to match the expected behavior in this JIRA. > Need to support "dominant" component concept inside YARN service > > > Key: YARN-8489 > URL: https://issues.apache.org/jira/browse/YARN-8489 > Project: Hadoop YARN > Issue Type: Task > Components: yarn-native-services >Reporter: Wangda Tan >Priority: Major > > Existing YARN service support termination policy for different restart > policies. For example ALWAYS means service will not be terminated. And NEVER > means if all component terminated, service will be terminated. > The name "dominant" might not be most appropriate , we can figure out better > names. But in simple, it means, a dominant component which final state will > determine job's final state regardless of other components. > Use cases: > 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to > final state, no matter if it is succeeded or failed, we should terminate > ps/tensorboard/workers. And the mark the job to succeeded/failed. > 2) Not sure if it is a real-world use case: A service which has multiple > component, some component is not restartable. For such services, if a > component is failed, we should mark the whole service to failed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org