[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-12-11 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716694#comment-16716694
 ] 

Zac Zhou edited comment on YARN-8489 at 12/11/18 9:47 AM:
--

[~suma.shivaprasad] any Updates? Or would you mind if I take it, as this jira 
blocks terminating submarine job gracefully.


was (Author: yuan_zac):
@[~suma.shivaprasad] any Updates? Or would you mind if I take it, as this jira 
blocks terminating submarine job gracefully.

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Assignee: Suma Shivaprasad
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-16 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652420#comment-16652420
 ] 

Eric Yang edited comment on YARN-8489 at 10/16/18 8:55 PM:
---

[~leftnoteasy] {quote}We will not support notebook and distributed TF job 
running in the service. I don't hear open source community like jupyter has 
support of this (connecting to a running distributed TF job and use it as 
executor). And I didn't see TF claims to support this or plan to support.{quote}

Jupyter notebook is part of official Docker Tensorflow image, and the 
architecture is [explained|https://www.tensorflow.org/extend/architecture] in 
official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] 
document. 

Here is an example of how to run distributed tensorflow with Jupyter notebook 
on YARN service:

{code}
{
  "name": "tensorflow-service",
  "version": "1.0",
  "kerberos_principal" : {
"principal_name" : "hbase/_h...@example.com",
"keytab" : "file:///etc/security/keytabs/hbase.service.keytab"
  },
  "components" :
  [
{
  "name": "jupyter",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true"
}
  },
  "restart_policy": "NEVER"
},
{
  "name": "ps",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "launch_command": "python ps.py",
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
}
  },
  "restart_policy": "NEVER"
},
{
  "name": "worker",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "launch_command": "python worker.py",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
}
  },
  "restart_policy": "NEVER"
}
  ]
}
{code}

ps.py
{code}
server = tf.train.Server(cluster,
   job_name=FLAGS.job_name,
   task_index=FLAGS.task_index)
server.join()
{code}

In jupyter notebook:
User can write code on the fly:
{code}
with tf.Session("grpc://worker-0.example.com:") as sess:
  for _ in range(1):
sess.run(train_op)
{code}

Isn't this the easiest way to iterate in notebook without going through 
ps/worker setup per iteration?  The only thing that user needs to write is 
worker.py which is use case driven.  Am I missing something?


was (Author: eyang):
[~leftnoteasy] {quote}We will not support notebook and distributed TF job 
running in the service. I don't hear open source community like jupyter has 
support of this (connecting to a running distributed TF job and use it as 
executor). And I didn't see TF claims to support this or plan to support.{quote}

Jupyter notebook is part of official Docker Tensorflow image, and the 
architecture is [explained|https://www.tensorflow.org/extend/architecture] in 
official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] 
document. 

Here is an example of how to run distributed tensorflow with Jupyter notebook 
on YARN service:

{code}
{
  "name": "tensorflow-service",
  "version": "1.0",
  "kerberos_principal" : {
"principal_name" : "hbase/_h...@example.com",
"keytab" : "file:///etc/security/keytabs/hbase.service.keytab"
  },
  "components" :
  [
{
  "name": "jupyter",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true"
}
  },
  "restart_policy": "NEVER"
},
{
  "name": "ps",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "launch_command": "python ps.py",
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
}
  },
  "restart_policy": "NEVER"
},
{
  

[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-16 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652420#comment-16652420
 ] 

Eric Yang edited comment on YARN-8489 at 10/16/18 8:49 PM:
---

[~leftnoteasy] {quote}We will not support notebook and distributed TF job 
running in the service. I don't hear open source community like jupyter has 
support of this (connecting to a running distributed TF job and use it as 
executor). And I didn't see TF claims to support this or plan to support.{quote}

Jupyter notebook is part of official Docker Tensorflow image, and the 
architecture is [explained|https://www.tensorflow.org/extend/architecture] in 
official [distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] 
document. 

Here is an example of how to run distributed tensorflow with Jupyter notebook 
on YARN service:

{code}
{
  "name": "tensorflow-service",
  "version": "1.0",
  "kerberos_principal" : {
"principal_name" : "hbase/_h...@example.com",
"keytab" : "file:///etc/security/keytabs/hbase.service.keytab"
  },
  "components" :
  [
{
  "name": "jupyter",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true"
}
  },
  "restart_policy": "NEVER"
},
{
  "name": "ps",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "launch_command": "python ps.py",
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
}
  },
  "restart_policy": "NEVER"
},
{
  "name": "worker",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "launch_command": "python worker.py",
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
}
  },
  "restart_policy": "NEVER"
}
  ]
}
{code}

ps.py
{code}
server = tf.train.Server(cluster,
   job_name=FLAGS.job_name,
   task_index=FLAGS.task_index)
server.join()
{code}

In jupyter notebook:
User can write code on the fly:
{code}
with tf.Session("grpc://worker7.example.com:") as sess:
  for _ in range(1):
sess.run(train_op)
{code}

Isn't this the easiest way to iterate in notebook without going through 
ps/worker setup per iteration?  The only thing that user needs to write is 
worker.py which is use case driven.  Am I missing something?


was (Author: eyang):
[~leftnoteasy] {quote}We will not support notebook and distributed TF job 
running in the service. I don't hear open source community like jupyter has 
support of this (connecting to a running distributed TF job and use it as 
executor). And I didn't see TF claims to support this or plan to support.{quote}

Jupyter notebook is part of official Docker Tensorflow image, and this is 
[explained|https://www.tensorflow.org/extend/architecture] in official 
[distributed Tensorflow|https://www.tensorflow.org/deploy/distributed] 
document. 

Here is an example of how to run distributed tensorflow with Jupyter notebook 
on YARN service:

{code}
{
  "name": "tensorflow-service",
  "version": "1.0",
  "kerberos_principal" : {
"principal_name" : "hbase/_h...@example.com",
"keytab" : "file:///etc/security/keytabs/hbase.service.keytab"
  },
  "components" :
  [
{
  "name": "jupyter",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true"
}
  },
  "restart_policy": "NEVER"
},
{
  "name": "ps",
  "number_of_containers": 1,
  "run_privileged_container": true,
  "artifact": {
"id": "tensorflow/tensorflow:1.10.1",
"type": "DOCKER"
  },
  "resource": {
"cpus": 1,
"memory": "256"
  },
  "launch_command": "python ps.py",
  "configuration": {
"env": {
  "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"false"
}
  },
  "restart_policy": "NEVER"
},
{
  "name": 

[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652271#comment-16652271
 ] 

Wangda Tan edited comment on YARN-8489 at 10/16/18 7:42 PM:


[~eyang],

Basically there're four modes in submarine for training jobs. 

1) A single node notebook runs single node TF training: 

User has a single node notebook which can do whatever they want. TF job runs 
inside the notebook, and not visible by submarine.

2) A single node notebook launches distributed TF training: 

Even this doesn't exist today, but it could be possible to be supported in the 
future. Such as adding submarine intercepter to Zeppelin. However, the notebook 
service and TF jobs are not belong to the same service, so this statement is 
not true: 
{quote} It would be bad user experience, if jupyter notebook and all work 
suddenly disappear when one ps server failed.
{quote}
3) Distributed TF job w/o notebook.

4) Single node TF job w/o notebook.

We will not support notebook and distributed TF job running in the same 
service. I don't hear open source community like jupyter has support of this 
(connecting to a running distributed TF job and use it as executor). And I 
didn't see TF claims to support this or plan to support.

And even if TF/notebook community support this case next year or so, notebook 
and executors should belong to two separate services just like relationship 
between Jupyter / Spark.


was (Author: leftnoteasy):
[~eyang],

Basically there're four modes in submarine for training jobs. 

1) A single node notebook runs single node TF training: 

User has a single node notebook which can do whatever they want. TF job runs 
inside the notebook, and not visible by submarine.

2) A single node notebook launches distributed TF training: 

Even this doesn't exist today, but it could be possible to be supported in the 
future. Such as adding submarine intercepter to Zeppelin. However, the notebook 
service and TF jobs are not belong to the same service, so this statement is 
not true: 
{quote} It would be bad user experience, if jupyter notebook and all work 
suddenly disappear when one ps server failed.
{quote}
3) Distributed TF job w/o notebook.

4) Single node TF job w/o notebook.

We will not support notebook and distributed TF job running in the service. I 
don't hear open source community like jupyter has support of this (connecting 
to a running distributed TF job and use it as executor). And I didn't see TF 
claims to support this or plan to support.

And even if TF/notebook community support this case, notebook and executors 
should belong to two separate services just like relationship between Jupyter / 
Spark.

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-16 Thread Wangda Tan (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16652271#comment-16652271
 ] 

Wangda Tan edited comment on YARN-8489 at 10/16/18 7:41 PM:


[~eyang],

Basically there're four modes in submarine for training jobs. 

1) A single node notebook runs single node TF training: 

User has a single node notebook which can do whatever they want. TF job runs 
inside the notebook, and not visible by submarine.

2) A single node notebook launches distributed TF training: 

Even this doesn't exist today, but it could be possible to be supported in the 
future. Such as adding submarine intercepter to Zeppelin. However, the notebook 
service and TF jobs are not belong to the same service, so this statement is 
not true: 
{quote} It would be bad user experience, if jupyter notebook and all work 
suddenly disappear when one ps server failed.
{quote}
3) Distributed TF job w/o notebook.

4) Single node TF job w/o notebook.

We will not support notebook and distributed TF job running in the service. I 
don't hear open source community like jupyter has support of this (connecting 
to a running distributed TF job and use it as executor). And I didn't see TF 
claims to support this or plan to support.

And even if TF/notebook community support this case, notebook and executors 
should belong to two separate services just like relationship between Jupyter / 
Spark.


was (Author: leftnoteasy):
[~eyang],

Basically there're four models in submarine for training jobs. 

1) A single node notebook runs single node TF training: 

User has a single node notebook which can do whatever they want. TF job runs 
inside the notebook, and not visible by submarine.

2) A single node notebook launches distributed TF training: 

Even this doesn't exist today, but it could be possible to be supported in the 
future. Such as adding submarine intercepter to Zeppelin. However, the notebook 
service and TF jobs are not belong to the same service, so this statement is 
not true: 
{quote} It would be bad user experience, if jupyter notebook and all work 
suddenly disappear when one ps server failed.
{quote}
3) Distributed TF job w/o notebook.

4) Single node TF job w/o notebook.

We will not support notebook and distributed TF job running in the service. I 
don't hear open source community like jupyter has support of this (connecting 
to a running distributed TF job and use it as executor). And I didn't see TF 
claims to support this or plan to support.

And even if TF/notebook community support this case, notebook and executors 
should belong to two separate services just like relationship between Jupyter / 
Spark.

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8489) Need to support "dominant" component concept inside YARN service

2018-10-15 Thread Eric Yang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16650876#comment-16650876
 ] 

Eric Yang edited comment on YARN-8489 at 10/15/18 10:16 PM:


We might be able to refine our existing definitions to enable this without 
defining additional restart policy or state.  If a service has two components 
defined, component A and B.  B depends on A.  Component A restart_policy=NEVER. 
 If component A failed, AM will toggle component A state to FLEXING, and 
component B continues to run.  Service is most likely not working anymore when 
it reached this state.  We may want to shutdown the service to match the 
expected behavior in this JIRA.


was (Author: eyang):
We might be able to refine our existing definitions to enable this without 
defining additional restart policy or state.  If a service has two components 
defined, component A and B.  B depends on A.  Component A restart_policy=NEVER. 
 If component A failed, AM will toggle component A state to FLEXING, and 
component B continues to run.  Service is most likely not working anymore when 
it reach this state.  We may want to shutdown the service to match the expected 
behavior in this JIRA.

> Need to support "dominant" component concept inside YARN service
> 
>
> Key: YARN-8489
> URL: https://issues.apache.org/jira/browse/YARN-8489
> Project: Hadoop YARN
>  Issue Type: Task
>  Components: yarn-native-services
>Reporter: Wangda Tan
>Priority: Major
>
> Existing YARN service support termination policy for different restart 
> policies. For example ALWAYS means service will not be terminated. And NEVER 
> means if all component terminated, service will be terminated.
> The name "dominant" might not be most appropriate , we can figure out better 
> names. But in simple, it means, a dominant component which final state will 
> determine job's final state regardless of other components.
> Use cases: 
> 1) Tensorflow job has master/worker/services/tensorboard. Once master goes to 
> final state, no matter if it is succeeded or failed, we should terminate 
> ps/tensorboard/workers. And the mark the job to succeeded/failed. 
> 2) Not sure if it is a real-world use case: A service which has multiple 
> component, some component is not restartable. For such services, if a 
> component is failed, we should mark the whole service to failed. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org