[
https://issues.apache.org/jira/browse/YARN-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16575464#comment-16575464
]
Wangda Tan commented on YARN-8561:
----------------------------------
Thanks [~sunilg]
For your addition comments:
1. I think we can register Cli plugin for this. Like a RunJobCli is registered
to "job" "run", same for help msg. I would prefer to do this in a separated
patch.
2. I'm not sure if we have a better solution for this, maybe a better idea is
to use YAML to define types and auto generate ...Parameter class and parser.
If we don't move ...Parameter to the same definition, we need to do a parsing
and setting in any case. Python will be easier to do such things.
I would prefer to think more about this. Given CLI parsing is a one-time
effort. We need to change 2-3 classes if we want to change and parameters, I
think we can live with it.
3. Now it is only for remote FS.
4. Basically, if user specified --checkpoint_dir in the commandline, they don't
need to pass the same parameter to launch_command of worker/ps again. User can
say: '--worker_launch_command "--ckpt-dir=%checkpoint_dir%'
5. I'm not quite sure about differences of them, could u explain?
6. Done.
7. Can we do it in a separate patch? I don't want to touch YARN stuffs within
the patch?
8. It may not be the highest priority given res profile is not easy to use. We
can revisit this in the future.
9. That is right, we can have separate JIRAs for this, and we may need to
design the CLI options carefully to not mix YARN-specific paramters and
DL-related parameters together.
10. That's right.
11. That's right, I think we can have follow up JIRA for that.
13. I think JobStatusBuilder#fromServiceState is for that, is that what you
meant?
> [Submarine] Add initial implementation: training job submission and job
> history retrieve.
> -----------------------------------------------------------------------------------------
>
> Key: YARN-8561
> URL: https://issues.apache.org/jira/browse/YARN-8561
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Wangda Tan
> Assignee: Wangda Tan
> Priority: Major
> Attachments: YARN-8561.001.patch, YARN-8561.002.patch,
> YARN-8561.003.patch, YARN-8561.004.patch, YARN-8561.005.patch
>
>
> Added following parts:
> 1) New subcomponent of YARN, under applications/ project.
> 2) Tensorflow training job submission, including training (single node and
> distributed).
> - Supported Docker container.
> - Support GPU isolation.
> - Support YARN registry DNS.
> 3) Retrieve job history.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]