[
https://issues.apache.org/jira/browse/YARN-8725?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16621339#comment-16621339
]
Wangda Tan commented on YARN-8725:
----------------------------------
Thanks [~tangzhankun],
{quote}But the job failed due to invalid path passed to "--job-dir" per my
testing. It should be a URI start with "hdfs://".
{quote}
The issue is handled by YARN-8757.
{quote}... Because the user is better to not know so such details
{quote}
I think it is fine since we don't suppose user set to the location if manual
checkpoint_path being specified.
{quote}Could you please elaborate on this?
{quote}
Sure, launch of worker/ps rely on these scripts localization. For default case,
Submarine client exits when application submitted to YARN. And this is before
worker/ps launch. If we do cleanup staging dir right before Submarine client
exits, it is very likely that ps/worker launch will be failed.
To be able to handle app failure, etc. cases, instead of adding cleanup logics
to cli, it's better to have a server to handle this.
> Submarine job staging directory has a lot of useless
> PRIMARY_WORKER-launch-script-***.sh scripts when submitting a job multiple
> times
> --------------------------------------------------------------------------------------------------------------------------------------
>
> Key: YARN-8725
> URL: https://issues.apache.org/jira/browse/YARN-8725
> Project: Hadoop YARN
> Issue Type: Sub-task
> Reporter: Zac Zhou
> Assignee: Zhankun Tang
> Priority: Major
> Attachments: YARN-8725-trunk.001.patch
>
>
> Submarine jobs upload core-site.xml, hdfs-site.xml, job.info and
> PRIMARY_WORKER-launch-script****.sh to staging dir.
> The core-site.xml, hdfs-site.xml and job.info would be overwritten if a job
> is submitted multiple times.
> But PRIMARY_WORKER-launch-script****.sh would not be overwritten, as it has
> random numbers in its name.
> The files in the staging dir are as follows:
> {code:java}
> -rw-r----- 2 hadoop hdfs 580 2018-08-17 10:11
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script6954941665090337726.sh
> -rw-r----- 2 hadoop hdfs 580 2018-08-17 10:02
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script7037369696166769734.sh
> -rw-r----- 2 hadoop hdfs 580 2018-08-17 10:06
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8047707294763488040.sh
> -rw-r----- 2 hadoop hdfs 15225 2018-08-17 18:46
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8122565781159446375.sh
> -rw-r----- 2 hadoop hdfs 580 2018-08-16 20:48
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script8598604480700049845.sh
> -rw-r----- 2 hadoop hdfs 580 2018-08-17 14:53
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script971703616848859353.sh
> -rw-r----- 2 hadoop hdfs 580 2018-08-17 10:16
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/PRIMARY_WORKER-launch-script990214235580089093.sh
> -rw-r----- 2 hadoop hdfs 8815 2018-08-27 15:54
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/core-site.xml
> -rw-r----- 2 hadoop hdfs 11583 2018-08-27 15:54
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/hdfs-site.xml
> -rw-rw-rw- 2 hadoop hdfs 846 2018-08-22 10:56
> hdfs://submarine/user/hadoop/submarine/jobs/standlone-tf/staging/job.info
> {code}
>
> We should stop the staging dir from growing or have a way to clean it up
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]