[jira] [Commented] (YARN-8561) [Submarine] Add initial implementation: training job submission and job history retrieve.

Wangda Tan (JIRA) Tue, 07 Aug 2018 11:16:10 -0700


    [ 
https://issues.apache.org/jira/browse/YARN-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16572087#comment-16572087
 ]


Wangda Tan commented on YARN-8561:
----------------------------------

Thanks [~sunilg],
1. Addressed.

2. I think we can rely on yarn app -kill for now, we can add more cleanups, 
etc. in the future.

3. The reasons I wrote a different one is the UnitsConversionUtil is not 
straightforward for user. Why G means 1000 and Gi means 1024. It gonna be very 
hard to update UnitsConversionUtil because of compatibility issue. Also, we 
don't need so many units, m/M/g/G will be enough.

4. IIRC, the capacity scheduler matcher is to check if abs resource is being 
used or not, not for parsing. I think the two configs are slightly different in 
syntax (Actually I don't remember what are the exactly differences here, but to 
be more flexible, I suggest to keep as is.)

5. The reason I keep it JobState is: 
- It's under submarine package.
- It's not likely that we will use mapreduce.JobState and submarine.JobState 
(and other classes like JobStatus, etc.) in the same class.

6. I think we can push this to the future patch, one possible solution is to 
include a yaml file to describe job configs and user can reuse it instead of 
passing 10+ params to CLI.

7. Done.

8. I'm not quite sure about this suggestion, it seems to me that we should add 
the getServiceResourceFromYarnResource method to service.Resource instead. I 
don't want to touch any service classes in this patch. Should we do it in a 
separate JIRA?

9. To me it is fine since we will print generated scripts and user can use 
\{{hadoop fs -cat}} to view files easily. Thoughts?

10. Done, now we throw exception when issue happens.

11. This depends on YARN-8488, once YARN-8488 got committed, we need to update 
this. (in a separate JIRA).

12. Done.

13. You meant increase it when job is running? For TF, this is not allowed.

The previous Jenkins report is gone, will update Jenkins reported issues in the 
next patch. 

> [Submarine] Add initial implementation: training job submission and job 
> history retrieve.
> -----------------------------------------------------------------------------------------
>
>                 Key: YARN-8561
>                 URL: https://issues.apache.org/jira/browse/YARN-8561
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>            Priority: Major
>         Attachments: YARN-8561.001.patch, YARN-8561.002.patch
>
>
> Added following parts:
> 1) New subcomponent of YARN, under applications/ project. 
> 2) Tensorflow training job submission, including training (single node and 
> distributed). 
> - Supported Docker container. 
> - Support GPU isolation. 
> - Support YARN registry DNS.
> 3) Retrieve job history.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (YARN-8561) [Submarine] Add initial implementation: training job submission and job history retrieve.

Reply via email to