[
https://issues.apache.org/jira/browse/YARN-8561?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16571719#comment-16571719
]
Sunil Govindan edited comment on YARN-8561 at 8/7/18 3:00 PM:
--------------------------------------------------------------
Thanks [~leftnoteasy] for the effort. I have tried to look through the approach
and code.
Few comments which is mixed or major and minor :)
1. I think we can used same CLI model of client where CLI extends Configured
and implements Tool. This helps for tests. Also this helps to avoid abstract
run method as its Tool.
2. We could also stop a job from CLI, correct? In that case, do we need to do
some thing more extra than a simple yarn app -kill appId ?
3. I think we can use UnitsConversionUtil for unit convertion.
CliUtils#parseResourcesString
4. In CapSchedConfig for absolute resource, we used a pattern match code.
{code}
public static final String PATTERN_FOR_ABSOLUTE_RESOURCE = "^\\[[\\w\\.,\\-_=\\
/]+\\]$";
private static final Pattern RESOURCE_PATTERN =
Pattern.compile(PATTERN_FOR_ABSOLUTE_RESOURCE);
{code}
Could we use same in CLI as well?
5. May be rename JobState to SubmarineJobState
6. Commandline options looks very clean and thorough. I think as we go forward,
more CLI options will be added. and it will become more complex. Could we load
a profile to submarine and then use the profile get 80% of such config items.
Given a profile, may be user might need to fill 1 or 2 variable arguments.
7. DevelopperGuide.md ==> DeveloperGuide.md
8. In getServiceResourceFromYarnResource, I think we should get the resource
list from ResourceUtils. Also it might be better to use a common client/server
util method to create resource. something like
Resource.newInstance(yarnResource) or Resources.createResource(yarnResource)
9. In verbose or debug mode, may be in YarnServiceJobSubmitter could dump all
contents of \{{FileWriter fw}}
10. It might be better to add a shutdown signal or interrupt signal to break
out from JobMonitor#waitTrainingFinal, if job is faulty.
11. In fromServiceState, service state STOPPED is considered as
JobState.SUCCEEDED;
12. Some commented code in JobStatusBuilder
13. How could we increase number of workers on a running job?
was (Author: sunilg):
Thanks [~leftnoteasy] for the effort. I have tried to look through the approach
and code.
Few comments which is mixed or major and minor :)
1. I think we can used same CLI model of client where CLI extends Configured
and implements Tool. This helps for tests. Also this helps to avoid abstract
run method as its Tool.
2. We could also stop a job from CLI, correct? In that case, do we need to do
some thing more extra than a simple yarn app -kill appId ?
3. I think we can use UnitsConversionUtil for unit convertion.
CliUtils#parseResourcesString
4. In CapSchedConfig for absolute resource, we used a pattern match code.
{code}
public static final String PATTERN_FOR_ABSOLUTE_RESOURCE = "^\\[[\\w\\.,\\-_=\\
/]+\\]$";
private static final Pattern RESOURCE_PATTERN =
Pattern.compile(PATTERN_FOR_ABSOLUTE_RESOURCE);
{code}
Could we use same in CLI as well?
5. May be rename JobState to SubmarineJobState
6. Commandline options looks very clean and thorough. I think as we go forward,
more CLI options will be added. and it will become more complex. Could we load
a profile to submarine and then use the profile get 80% of such config items.
Given a profile, may be user might need to fill 1 or 2 variable arguments.
7. DevelopperGuide.md ==> DeveloperGuide.md
> [Submarine] Add initial implementation: training job submission and job
> history retrieve.
> -----------------------------------------------------------------------------------------
>
> Key: YARN-8561
> URL: https://issues.apache.org/jira/browse/YARN-8561
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Wangda Tan
> Assignee: Wangda Tan
> Priority: Major
> Attachments: YARN-8561.001.patch
>
>
> Added following parts:
> 1) New subcomponent of YARN, under applications/ project.
> 2) Tensorflow training job submission, including training (single node and
> distributed).
> - Supported Docker container.
> - Support GPU isolation.
> - Support YARN registry DNS.
> 3) Retrieve job history.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]