Wangda Tan commented on YARN-8135:


Thanks for the responses, 
{quote}what does w/o modification mean ?
Without modification of vanilla TF program in order to run on the framework.
{quote}As far as Kubeflow is deployed in the same cluster as Hadoop, Kubeflow 
should be able to access HDFS, through libhdfs or webhdfs interface?
Since tensorflow supports to read HDFS, ideally all platform can support this 
:). What I meant here is, TF read HDFS needs lots of configurations, and needs 
some specific optimization / considerations to make HDFS access from Docker 
container easier. Our on-going prototype covers some of this problem. 
{quote}ToS kind of supports GPU scheduling (not isolation) base on memory: if 
you ask for 1 GPU and a machine has 4 GPU, it asks for total memory * the 
portion of GPU you asked.
This is not easy for user and cannot guarantee proper isolation, so I didn't 
put a (√) for ToS.


> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -------------------------------------------------------------------------------------------------------------
>                 Key: YARN-8135
>                 URL: https://issues.apache.org/jira/browse/YARN-8135
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>            Priority: Major
>         Attachments: image-2018-04-09-14-35-16-778.png, 
> image-2018-04-09-14-44-41-101.png
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can let human to explore deep 
> places. B-)
> Compare to other projects:
> !image-2018-04-09-14-44-41-101.png!
> *Notes:*
> *GPU Isolation of XLearning project is achieved by patched YARN, which is 
> different from community’s GPU isolation solution.
> **XLearning needs few modification to read ClusterSpec from env.
> *References:*
>  - TensorflowOnSpark (Yahoo): [https://github.com/yahoo/TensorFlowOnSpark]
>  - TensorFlowOnYARN (Intel): 
> [https://github.com/Intel-bigdata/TensorFlowOnYARN]
>  - Spark Deep Learning (Databricks): 
> [https://github.com/databricks/spark-deep-learning]
>  - XLearning (Qihoo360): [https://github.com/Qihoo360/XLearning]
>  - Kubeflow (Google): [https://github.com/kubeflow/kubeflow]

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to