[jira] [Comment Edited] (YARN-8725) Submarine job staging directory has a lot of useless PRIMARY_WORKER-launch-script-***.sh scripts when submitting a job multiple times
[ https://issues.apache.org/jira/browse/YARN-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16622159#comment-16622159 ] Zac Zhou edited comment on YARN-8725 at 9/20/18 2:51 PM: - [~leftnoteasy], [~tangzhankun]. Thanks a lot for your efforts to make this sub task in progress. I just found there were files in local file system, which were not clean up as well. I just opened a new ticket YARN-8806 and submitted a patch. it would be nice if you could look into it as well ~ was (Author: yuan_zac): [~leftnoteasy], [~tangzhankun]. Thanks a lot for your efforts to make this sub task in progress. I just found there were files in local file system, which were not clean up as well. I just opened a new ticket [YARN-8806|https://issues.apache.org/jira/browse/YARN-8806] and submit a patch. it would be nice if you could look into it as well ~ > Submarine job staging directory has a lot of useless > PRIMARY_WORKER-launch-script-***.sh scripts when submitting a job multiple > times > -- > > Key: YARN-8725 > URL: https://issues.apache.org/jira/browse/YARN-8725 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zac Zhou >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8725-trunk.001.patch > > > Submarine jobs upload core-site.xml, hdfs-site.xml, job.info and > PRIMARY_WORKER-launch-script.sh to staging dir. > The core-site.xml, hdfs-site.xml and job.info would be overwritten if a job > is submitted multiple times. > But PRIMARY_WORKER-launch-script.sh would not be overwritten, as it has > random numbers in its name. > The files in the staging dir are as follows: > {code:java} > -rw-r- 2 hadoop hdfs 580 2018-08-17 10:11 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script6954941665090337726.sh > -rw-r- 2 hadoop hdfs 580 2018-08-17 10:02 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script7037369696166769734.sh > -rw-r- 2 hadoop hdfs 580 2018-08-17 10:06 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8047707294763488040.sh > -rw-r- 2 hadoop hdfs 15225 2018-08-17 18:46 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8122565781159446375.sh > -rw-r- 2 hadoop hdfs 580 2018-08-16 20:48 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8598604480700049845.sh > -rw-r- 2 hadoop hdfs 580 2018-08-17 14:53 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script971703616848859353.sh > -rw-r- 2 hadoop hdfs 580 2018-08-17 10:16 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script990214235580089093.sh > -rw-r- 2 hadoop hdfs 8815 2018-08-27 15:54 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/core-site.xml > -rw-r- 2 hadoop hdfs 11583 2018-08-27 15:54 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/hdfs-site.xml > -rw-rw-rw- 2 hadoop hdfs 846 2018-08-22 10:56 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/job.info > {code} > > We should stop the staging dir from growing or have a way to clean it up -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (YARN-8725) Submarine job staging directory has a lot of useless PRIMARY_WORKER-launch-script-***.sh scripts when submitting a job multiple times
[ https://issues.apache.org/jira/browse/YARN-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16620834#comment-16620834 ] Zhankun Tang edited comment on YARN-8725 at 9/19/18 4:52 PM: - [~leftnoteasy] {quote}cleanup whole staging dir seems overkill because models, etc. by default is placed under the directory as well. {quote} Thanks for pointing out this model directory stuff. This patch seems overkill. Let's hold on it. I double-checked the relations between "--checkpoint_path", "--saved_model_path" and "–input_path". For the --checkpoint_path", the situations are as below: 1. if set, in my case I set this "--checkpoint_path hdfs://default/user/yarn/cifar-10-jobdir". So it'll be safe to delete the staging dir since the checkout_dir that contains model data is not in staging dir. 2. if not set, it'll replace "%checkpoint_path%" with "submarine/jobs/tf-job-001/staging/checkpoint_path" due to below code: {code:java} public static String replacePatternsInLaunchCommand(String specifiedCli, RunJobParameters jobRunParameters, RemoteDirectoryManager directoryManager) throws IOException { String jobDir = jobRunParameters.getCheckpointPath(); if (null == jobDir) { jobDir = directoryManager.getJobCheckpointDir(jobRunParameters.getName(), true).toString(); } String input = jobRunParameters.getInputPath(); String savedModelDir = jobRunParameters.getSavedModelPath(); if (null == savedModelDir) { savedModelDir = jobDir; } Map replacePattern = new HashMap<>(); if (jobDir != null) { replacePattern.put("%" + CliConstants.CHECKPOINT_PATH + "%", jobDir); } ... if (savedModelDir != null) { replacePattern.put("%" + CliConstants.SAVED_MODEL_PATH + "%", savedModelDir); }{code} {code:java} 2018-09-19 23:19:34,729 INFO yarnservice.YarnServiceJobSubmitter: Worker command =[cd /cifar10_estimator && python cifar10_main.py --data-dir=hdfs://default/user/yarn/cifar-10-data --job-dir=submarine/jobs/tf-job-001/staging/checkpoint_path --num-gpus=0 --train-steps=2]{code} But the job failed due to invalid path passed to "--job-dir" per my testing. It should be a URI start with "hdfs://". And attached the script I use for this testing. Has submitted a Jira to track this. YARN-8799 {code:java} yarn jar $HADOOP_BASE_DIR/home/share/hadoop/yarn/hadoop-yarn-submarine-3.2.0-SNAPSHOT.jar job run \ -verbose \ -wait_job_finish \ -keep_staging_dir \ --env DOCKER_JAVA_HOME=/usr/lib/jvm/java-8-oracle \ --env DOCKER_HADOOP_HDFS_HOME=/hadoop-3.2.0-SNAPSHOT \ --name tf-job-001 \ --docker_image tangzhankun/tensorflow \ --input_path hdfs://default/user/yarn/cifar-10-data \ --worker_resources memory=4G,vcores=2 \ --worker_launch_cmd "cd /cifar10_estimator && python cifar10_main.py --data-dir=%input_path% --job-dir=%checkpoint_path% --num-gpus=0 --train-steps=2"{code} *I thought that if a user sets the "–checkpoint_path" option explicitly, it seems not possible that the end user set a path string under the internal staging dir like "*hdfs*://default/user/yarn/submarine/jobs/*tf*-job-001/staging/checkpoint_path"(Because the user is better to not know so such details). And in the other hand, don't set this option but still use the pattern "%checkpoint_path" in worker command seems strange to me.* *So I assume we'll put checkpoint_path outside of staging dir, missed above script test and also the current code fact that we intend to put it under the staging dir by default. :)* For the "–input_path", it's a must option. Nothing more to discuss except we should check invalid value(YARN-8798). For the "--saved_model_path", it might have the same default value issue(needs more tests). But it's mainly for serving. won't discuss here. {quote}And logics in your patch cleans up dirs after job submitted. It is possible that workers get launched after dir got deleted. {quote} Could you please elaborate on this? {quote}I'm not sure if we can do many meaningful things here in the client code. It might be better to do this in the server side, I don't have a clear idea about how to handle the service part. Maybe it should be a plugin of ApiServer, or it is a completely new service like a system service. {quote} Maybe let's talk about this offline. was (Author: tangzhankun): [~leftnoteasy] {quote}cleanup whole staging dir seems overkill because models, etc. by default is placed under the directory as well. {quote} Thanks for pointing out this model directory stuff. This patch seems overkill. Let's hold on it. I double-checked the relations between "--checkpoint_path", "--saved_model_path" and "–input_path". For the --checkpoint_path", the situations are as below: 1. if set, in my case I set this "--checkpoint_path hdfs://default/user/yarn/cifar-10-jobdir". So it'll be safe to delete the staging dir since the checkout_dir that contains model data is not in staging dir. 2. if not
[jira] [Comment Edited] (YARN-8725) Submarine job staging directory has a lot of useless PRIMARY_WORKER-launch-script-***.sh scripts when submitting a job multiple times
[ https://issues.apache.org/jira/browse/YARN-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16617600#comment-16617600 ] Zhankun Tang edited comment on YARN-8725 at 9/17/18 2:36 PM: - Added a patch which does following: # add a new option "--keep_staging_dir". It's false by default so that we'll clean up the staging directory after job finish # add unit test case through "MockRemoteDirectoryManager". # Changes(staging dir creation) to existing unit test due to the need for a real directory in local fs for "cleanupStagingDir" to work Please help review. [~wangda] [~sunilg] [~yuan_zac] was (Author: tangzhankun): Added a patch which does following: # add a new option "--keep_staging_dir". It's false by default so that we'll clean up the staging directory after job finish # added unit test case through "MockRemoteDirectoryManager". # Changes(staging dir creation) to existing unit test due to the need for a real directory in local fs for "cleanupStagingDir" to work Please help review. [~wangda] [~sunilg] [~yuan_zac] > Submarine job staging directory has a lot of useless > PRIMARY_WORKER-launch-script-***.sh scripts when submitting a job multiple > times > -- > > Key: YARN-8725 > URL: https://issues.apache.org/jira/browse/YARN-8725 > Project: Hadoop YARN > Issue Type: Sub-task >Reporter: Zac Zhou >Assignee: Zhankun Tang >Priority: Major > Attachments: YARN-8725-trunk.001.patch > > > Submarine jobs upload core-site.xml, hdfs-site.xml, job.info and > PRIMARY_WORKER-launch-script.sh to staging dir. > The core-site.xml, hdfs-site.xml and job.info would be overwritten if a job > is submitted multiple times. > But PRIMARY_WORKER-launch-script.sh would not be overwritten, as it has > random numbers in its name. > The files in the staging dir are as follows: > {code:java} > -rw-r- 2 hadoop hdfs 580 2018-08-17 10:11 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script6954941665090337726.sh > -rw-r- 2 hadoop hdfs 580 2018-08-17 10:02 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script7037369696166769734.sh > -rw-r- 2 hadoop hdfs 580 2018-08-17 10:06 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8047707294763488040.sh > -rw-r- 2 hadoop hdfs 15225 2018-08-17 18:46 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8122565781159446375.sh > -rw-r- 2 hadoop hdfs 580 2018-08-16 20:48 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8598604480700049845.sh > -rw-r- 2 hadoop hdfs 580 2018-08-17 14:53 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script971703616848859353.sh > -rw-r- 2 hadoop hdfs 580 2018-08-17 10:16 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script990214235580089093.sh > -rw-r- 2 hadoop hdfs 8815 2018-08-27 15:54 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/core-site.xml > -rw-r- 2 hadoop hdfs 11583 2018-08-27 15:54 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/hdfs-site.xml > -rw-rw-rw- 2 hadoop hdfs 846 2018-08-22 10:56 > hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/job.info > {code} > > We should stop the staging dir from growing or have a way to clean it up -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: yarn-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: yarn-issues-h...@hadoop.apache.org