[jira] [Comment Edited] (YARN-8725) Submarine job staging directory has a lot of useless PRIMARY_WORKER-launch-script-***.sh scripts when submitting a job multiple times

2018-09-20 Thread Zac Zhou (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622159#comment-16622159
 ] 

Zac Zhou edited comment on YARN-8725 at 9/20/18 2:51 PM:
-

[~leftnoteasy], [~tangzhankun]. Thanks a lot for your efforts to make this sub 
task in progress.

I just found there were files in local file system, which were not clean up as 
well. 

I just opened a new ticket  YARN-8806 and submitted a patch. it would be nice 
if you could look into it as well ~


was (Author: yuan_zac):
[~leftnoteasy], [~tangzhankun]. Thanks a lot for your efforts to make this sub 
task in progress.

I just found there were files in local file system, which were not clean up as 
well. 

I just opened a new ticket  
[YARN-8806|https://issues.apache.org/jira/browse/YARN-8806] and submit a patch. 
it would be nice if you could look into it as well ~

> Submarine job staging directory has a lot of useless 
> PRIMARY_WORKER-launch-script-***.sh  scripts when submitting a job multiple 
> times
> --
>
> Key: YARN-8725
> URL: https://issues.apache.org/jira/browse/YARN-8725
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8725-trunk.001.patch
>
>
> Submarine jobs upload core-site.xml, hdfs-site.xml, job.info and 
> PRIMARY_WORKER-launch-script.sh to staging dir.
> The core-site.xml, hdfs-site.xml and job.info would be overwritten if a job 
> is submitted multiple times.
> But PRIMARY_WORKER-launch-script.sh would not be overwritten, as it has 
> random numbers in its name.
> The files in the staging dir are as follows:
> {code:java}
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:11 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script6954941665090337726.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:02 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script7037369696166769734.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:06 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8047707294763488040.sh
> -rw-r- 2 hadoop hdfs 15225 2018-08-17 18:46 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8122565781159446375.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-16 20:48 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8598604480700049845.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 14:53 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script971703616848859353.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:16 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script990214235580089093.sh
> -rw-r- 2 hadoop hdfs 8815 2018-08-27 15:54 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/core-site.xml
> -rw-r- 2 hadoop hdfs 11583 2018-08-27 15:54 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/hdfs-site.xml
> -rw-rw-rw- 2 hadoop hdfs 846 2018-08-22 10:56 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/job.info
> {code}
>  
> We should stop the staging dir from growing or have a way to clean it up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (YARN-8725) Submarine job staging directory has a lot of useless PRIMARY_WORKER-launch-script-***.sh scripts when submitting a job multiple times

2018-09-19 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620834#comment-16620834
 ] 

Zhankun Tang edited comment on YARN-8725 at 9/19/18 4:52 PM:
-

[~leftnoteasy]
{quote}cleanup whole staging dir seems overkill because models, etc. by default 
is placed under the directory as well.
{quote}
Thanks for pointing out this model directory stuff. This patch seems overkill. 
Let's hold on it.

I double-checked the relations between "--checkpoint_path", 
"--saved_model_path" and "–input_path".

For the --checkpoint_path", the situations are as below:

1. if set, in my case I set this "--checkpoint_path 
hdfs://default/user/yarn/cifar-10-jobdir". So it'll be safe to delete the 
staging dir since the checkout_dir that contains model data is not in staging 
dir.

2. if not set, it'll replace "%checkpoint_path%" with 
"submarine/jobs/tf-job-001/staging/checkpoint_path" due to below code:

 
{code:java}
public static String replacePatternsInLaunchCommand(String specifiedCli,
 RunJobParameters jobRunParameters,
 RemoteDirectoryManager directoryManager) throws IOException {
 String jobDir = jobRunParameters.getCheckpointPath();
 if (null == jobDir) {
 jobDir = directoryManager.getJobCheckpointDir(jobRunParameters.getName(),
 true).toString();
 }

 String input = jobRunParameters.getInputPath();
 String savedModelDir = jobRunParameters.getSavedModelPath();
 if (null == savedModelDir) {
 savedModelDir = jobDir;
 }

 Map replacePattern = new HashMap<>();
 if (jobDir != null) {
 replacePattern.put("%" + CliConstants.CHECKPOINT_PATH + "%", jobDir);
 }
...
if (savedModelDir != null) {
  replacePattern.put("%" + CliConstants.SAVED_MODEL_PATH + "%",
  savedModelDir);
}{code}
 

 
{code:java}
2018-09-19 23:19:34,729 INFO yarnservice.YarnServiceJobSubmitter: Worker 
command =[cd /cifar10_estimator && python cifar10_main.py 
--data-dir=hdfs://default/user/yarn/cifar-10-data 
--job-dir=submarine/jobs/tf-job-001/staging/checkpoint_path --num-gpus=0 
--train-steps=2]{code}
 

But the job failed due to invalid path passed to "--job-dir" per my testing. It 
should be a URI start with "hdfs://".

And attached the script I use for this testing. Has submitted a Jira to track 
this. YARN-8799
{code:java}
yarn jar 
$HADOOP_BASE_DIR/home/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar
 job run \
 -verbose \
 -wait_job_finish \
 -keep_staging_dir \
 --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-oracle \
 --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.2.0-SNAPSHOT \
 --name tf-job-001 \
 --docker_image tangzhankun/tensorflow \
 --input_path hdfs://default/user/yarn/cifar-10-data \
 --worker_resources memory=4G,vcores=2 \
 --worker_launch_cmd "cd /cifar10_estimator && python cifar10_main.py 
--data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0 
--train-steps=2"{code}
 

*I thought that if a user sets the "–checkpoint_path" option explicitly, it 
seems not possible that the end user set a path string under the internal 
staging dir like 
"*hdfs*://default/user/yarn/submarine/jobs/*tf*-job-001/staging/checkpoint_path"(Because
 the user is better to not know so such details). And in the other hand, don't 
set this option but still use the pattern "%checkpoint_path" in worker command 
seems strange to me.*

*So I assume we'll put checkpoint_path outside of staging dir, missed above 
script test and also the current code fact that we intend to put it under the 
staging dir by default. :)*

 

For the "–input_path", it's a must option. Nothing more to discuss except we 
should check invalid value(YARN-8798).

For the "--saved_model_path", it might have the same default value issue(needs 
more tests).  But it's mainly for serving. won't discuss here.

 
{quote}And logics in your patch cleans up dirs after job submitted. It is 
possible that workers get launched after dir got deleted.
{quote}
Could you please elaborate on this?
{quote}I'm not sure if we can do many meaningful things here in the client 
code. It might be better to do this in the server side, I don't have a clear 
idea about how to handle the service part. Maybe it should be a plugin of 
ApiServer, or it is a completely new service like a system service.
{quote}
Maybe let's talk about this offline.


was (Author: tangzhankun):
[~leftnoteasy]
{quote}cleanup whole staging dir seems overkill because models, etc. by default 
is placed under the directory as well.
{quote}
Thanks for pointing out this model directory stuff. This patch seems overkill. 
Let's hold on it.

I double-checked the relations between "--checkpoint_path", 
"--saved_model_path" and "–input_path".

For the --checkpoint_path", the situations are as below:

1. if set, in my case I set this "--checkpoint_path 
hdfs://default/user/yarn/cifar-10-jobdir". So it'll be safe to delete the 
staging dir since the checkout_dir that contains model data is not in staging 
dir.

2. if not 

[jira] [Comment Edited] (YARN-8725) Submarine job staging directory has a lot of useless PRIMARY_WORKER-launch-script-***.sh scripts when submitting a job multiple times

2018-09-17 Thread Zhankun Tang (JIRA)


[ 
https://issues.apache.org/jira/browse/YARN-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617600#comment-16617600
 ] 

Zhankun Tang edited comment on YARN-8725 at 9/17/18 2:36 PM:
-

Added a patch which does following:
 # add a new option "--keep_staging_dir". It's false by default so that we'll 
clean up the staging directory after job finish
 # add unit test case through "MockRemoteDirectoryManager".
 # Changes(staging dir creation) to existing unit test due to the need for a 
real directory in local fs for "cleanupStagingDir" to work

Please help review. [~wangda] [~sunilg] [~yuan_zac]


was (Author: tangzhankun):
Added a patch which does following:
 # add a new option "--keep_staging_dir". It's false by default so that we'll 
clean up the staging directory after job finish
 # added unit test case through "MockRemoteDirectoryManager".
 # Changes(staging dir creation) to existing unit test due to the need for a 
real directory in local fs for "cleanupStagingDir" to work

Please help review. [~wangda] [~sunilg] [~yuan_zac]

> Submarine job staging directory has a lot of useless 
> PRIMARY_WORKER-launch-script-***.sh  scripts when submitting a job multiple 
> times
> --
>
> Key: YARN-8725
> URL: https://issues.apache.org/jira/browse/YARN-8725
> Project: Hadoop YARN
>  Issue Type: Sub-task
>Reporter: Zac Zhou
>Assignee: Zhankun Tang
>Priority: Major
> Attachments: YARN-8725-trunk.001.patch
>
>
> Submarine jobs upload core-site.xml, hdfs-site.xml, job.info and 
> PRIMARY_WORKER-launch-script.sh to staging dir.
> The core-site.xml, hdfs-site.xml and job.info would be overwritten if a job 
> is submitted multiple times.
> But PRIMARY_WORKER-launch-script.sh would not be overwritten, as it has 
> random numbers in its name.
> The files in the staging dir are as follows:
> {code:java}
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:11 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script6954941665090337726.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:02 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script7037369696166769734.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:06 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8047707294763488040.sh
> -rw-r- 2 hadoop hdfs 15225 2018-08-17 18:46 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8122565781159446375.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-16 20:48 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8598604480700049845.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 14:53 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script971703616848859353.sh
> -rw-r- 2 hadoop hdfs 580 2018-08-17 10:16 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script990214235580089093.sh
> -rw-r- 2 hadoop hdfs 8815 2018-08-27 15:54 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/core-site.xml
> -rw-r- 2 hadoop hdfs 11583 2018-08-27 15:54 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/hdfs-site.xml
> -rw-rw-rw- 2 hadoop hdfs 846 2018-08-22 10:56 
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/job.info
> {code}
>  
> We should stop the staging dir from growing or have a way to clean it up



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org