[ 
https://issues.apache.org/jira/browse/YARN-8135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16431333#comment-16431333
 ] 

Wangda Tan commented on YARN-8135:
----------------------------------

I'm currently working on a design doc and a prototype, will share more details 
in the next several days.

> Hadoop {Submarine} Project: Simple and scalable deployment of deep learning 
> training / serving jobs on Hadoop
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-8135
>                 URL: https://issues.apache.org/jira/browse/YARN-8135
>             Project: Hadoop YARN
>          Issue Type: New Feature
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>            Priority: Major
>         Attachments: image-2018-04-09-14-35-16-778.png
>
>
> Description:
> *Goals:*
>  - Allow infra engineer / data scientist to run *unmodified* Tensorflow jobs 
> on YARN.
>  - Allow jobs easy access data/models in HDFS and other storages.
>  - Can launch services to serve Tensorflow/MXNet models.
>  - Support run distributed Tensorflow jobs with simple configs.
>  - Support run user-specified Docker images.
>  - Support specify GPU and other resources.
>  - Support launch tensorboard if user specified.
>  - Support customized DNS name for roles (like tensorboard.$user.$domain:6006)
> *Why this name?*
>  - Because Submarine is the only vehicle can take human to deep places. B-)
> Compare to other projects:
> !image-2018-04-09-14-35-16-778.png!
> *Notes:*
> * GPU Isolation of XLearning project is achieved by patched YARN, which is 
> different from community’s GPU isolation solution.
> ** XLearning needs few modification to read ClusterSpec from env.
> *References:*
> - TensorflowOnSpark (Yahoo): https://github.com/yahoo/TensorFlowOnSpark
> - TensorFlowOnYARN (Intel): https://github.com/Intel-bigdata/TensorFlowOnYARN
> - Spark Deep Learning (Databricks): 
> https://github.com/databricks/spark-deep-learning
> - XLearning (Qihoo360): https://github.com/Qihoo360/XLearning
> - Kubeflow (Google): https://github.com/kubeflow/kubeflow



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org

Reply via email to