Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
github-actions[bot] closed pull request #43936: [SPARK-46034][CORE] SparkContext add file should also copy file to local root path URL: https://github.com/apache/spark/pull/43936 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
github-actions[bot] commented on PR #43936: URL: https://github.com/apache/spark/pull/43936#issuecomment-1984826472 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
HyukjinKwon commented on PR #43936: URL: https://github.com/apache/spark/pull/43936#issuecomment-1829030542 @AngersZh can you fix the PR description, and explain how this issue happens specifically in Yarn cluster mode? I still can't fully follow what where is the issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
junyi1313 commented on PR #43936: URL: https://github.com/apache/spark/pull/43936#issuecomment-1829023594 > The root cause is spark driver download file to it's `driverTempPath`, but didn't download to container's execution root path. So in yarn cluster mode, if we need to use the file in driver side, it throw FileNotFoundException. cc @HyukjinKwon @cloud-fan I have updated the code. I encounter the same issue too. I think this should be fixed asap. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
tgravescs commented on PR #43936: URL: https://github.com/apache/spark/pull/43936#issuecomment-1828063684 I have not used addFiles on yarn in a long time so can't speak to whether it got broken. Generally speaking it's not recommended and user should pass files on submission. Whatever happens here we need to make sure it works in both cluster and client modes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
AngersZh commented on PR #43936: URL: https://github.com/apache/spark/pull/43936#issuecomment-1827485987 I don't know why we add a `driverTmpDir`, remove `driverTmpDir` also can resolve this issue. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
AngersZh commented on PR #43936: URL: https://github.com/apache/spark/pull/43936#issuecomment-1827474881 > Could you please update the PR description? It looks overdue and inaccurate for me to catch up with. > > Could you also provide the output for `LIST FILE` both before and after this? I guess we shall use its output at the caller-side, let's verify that they match. The result of `LIST FILES`, won't have impact since we update `addedFiles` with origin path ``` spark-sql> add jar hdfs://path/search_hadoop_udf-1.0.0-SNAPSHOT.jar; Time taken: 3.055 seconds spark-sql> add file hdfs://path/feature_map.txt; spark-sql> list jars; hdfs://path/search_hadoop_udf-1.0.0-SNAPSHOT.jar Time taken: 0.443 seconds, Fetched 1 row(s) spark-sql> list files; hdfs://path/feature_map.txt Time taken: 0.428 seconds, Fetched 1 row(s) spark-sql> ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
yaooqinn commented on PR #43936: URL: https://github.com/apache/spark/pull/43936#issuecomment-1827051972 Could you please update the PR description? It looks overdue and inaccurate for me to catch up with. Could you also provide the output for `LIST FILE`? I guess we shall use its output at the caller-side, let's verify that they match. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
AngersZh commented on PR #43936: URL: https://github.com/apache/spark/pull/43936#issuecomment-1827039250 gentle ping @yaooqinn @cloud-fan @HyukjinKwon @tgravescs Could you take a look? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
AngersZh commented on PR #43936: URL: https://github.com/apache/spark/pull/43936#issuecomment-1825190497 > Can we fix `SparkFiles.get` at driver side when Yarn cluster is used? `SparkFiles.get()` don't have problem. The problem is user use relative path to use the added file. But `SparkContext.addFile` add to driverTempDir ``` def getRootDirectory(): String = SparkEnv.get.driverTmpDir.getOrElse(".") ``` But executing use `.` path to get file. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
HyukjinKwon commented on PR #43936: URL: https://github.com/apache/spark/pull/43936#issuecomment-1825009537 Can we fix `SparkFiles.get` at driver side when Yarn cluster is used? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
AngersZh commented on PR #43936: URL: https://github.com/apache/spark/pull/43936#issuecomment-1824131573 Any more suggestion? cc @HyukjinKwon @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
AngersZh commented on code in PR #43936: URL: https://github.com/apache/spark/pull/43936#discussion_r1401427353 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging { logInfo(s"Added file $path at $key with timestamp $timestamp") // Fetch the file locally so that closures which are run on the driver can still use the // SparkFiles API to access files. - Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false) + Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true) Review Comment: Executor log when `updateDependencies` ``` 23/11/21 17:44:55 INFO Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/fetchFileTemp5380393885914736245.tmp 23/11/21 17:44:55 INFO Utils: Copying /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/-17061381181700559593903_cache to /mnt/ssd/1/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/container_e59_1698132018785_8173703_01_000683/./feature_map.txt ``` In executor side, pass `useCache = true` when is not local mode, then executor will fetch the file to cache then copy cache file to root dir with filename. For sparkcontext dirver, current code pass `useCache=false` only fetch file as file temp ``` 23/11/21 17:39:53 INFO [pool-3-thread-2] SparkContext: Added file hdfs://path/feature_map.txt at hdfs://path/feature_map.txt with timestamp 1700559593903 23/11/21 17:39:54 INFO [pool-3-thread-2] Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/0/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-21bedef6-1c5e-464e-9cb0-bb6903b3d84c/userFiles-a4929fdb-b634-4829-a7e3-00d82b0d521b/fetchFileTemp8739978227963911629.tmp ``` So the added file won't exist under root dir with it's filename. The code of `Utils.fetchFile()` as below https://github.com/apache/spark/assets/46485123/68f6e2f9-a6e2-493d-bd65-d7b2cc88fadd;> It's clear that executor is local should pass `useCache=false` since in local mode, it should use file fetched by sc. But current code, sc won't add this file with it's file name. So I think should be like 1. SC add file should also copy file to root dir with the file name, then driver side also can get the file with file name then can run local task in driver 2. For non-local mode executor will also update the dependencies and work well 3. For local mode executor, it was started in driver process. It can use the file downloaded by `SC.addFile()` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
AngersZh commented on PR #43936: URL: https://github.com/apache/spark/pull/43936#issuecomment-1822478004 ``` Fetching hdfs://R2/projects/search_algo/hdfs/dev/typhoon.bo/uploader/ego_config/feature_map.txt to /mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/spark-32bf42ec-8de1-4f0e-aa60-def6d0f185c6/userFiles-61713d37-87ab-42d6-bdcd-065fa65ac8fc/fetchFileTemp8380676662376886133.tmp Moving /mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/spark-32bf42ec-8de1-4f0e-aa60-def6d0f185c6/userFiles-61713d37-87ab-42d6-bdcd-065fa65ac8fc/fetchFileTemp8380676662376886133.tmp to /mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/spark-32bf42ec-8de1-4f0e-aa60-def6d0f185c6/userFiles-61713d37-87ab-42d6-bdcd-065fa65ac8fc/feature_map.txt Fetching hdfs://R2/projects/search_algo/hdfs/dev/typhoon.bo/uploader/ego_config/feature_map.txt to /mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/container_e2006_1698132018785_8495989_01_01/./fetchFileTemp7654378954291310228.tmp Moving /mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/container_e2006_1698132018785_8495989_01_01/./fetchFileTemp7654378954291310228.tmp to /mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/container_e2006_1698132018785_8495989_01_01/./feature_map.txt ``` Current code work well in when testing in our env. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
AngersZh commented on PR #43936: URL: https://github.com/apache/spark/pull/43936#issuecomment-1822401850 The root cause is spark driver download file to it's `driverTempPath`, but didn't download to container's execution root path. So in yarn cluster mode, if we need to use the file in driver side, it throw FileNotFoundException. cc @HyukjinKwon @cloud-fan I have updated the code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
AngersZh commented on PR #43936: URL: https://github.com/apache/spark/pull/43936#issuecomment-1822369346 > cc @mridulm or @tgravescs have you ever seen such things before?: `SparkContext.addFiles` adds a file with a temporary name, and you cannot get it with the original name from `SparkFiles.get(...)` I have checked, this is a mistake, since it use move file logic, so there is no log. The true reason is as https://github.com/apache/spark/pull/43936#issuecomment-1822339813 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
HyukjinKwon commented on PR #43936: URL: https://github.com/apache/spark/pull/43936#issuecomment-1822356253 cc @mridulm have you ever seen such things before? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
AngersZh commented on PR #43936: URL: https://github.com/apache/spark/pull/43936#issuecomment-1822339813 Should be I make a mistake. In driver side, file was download and copy to path driverTempPath ``` /mnt/ssd/0/yarn/nm-local-dir/usercache/typhoon.bo/appcache/application_1698132018785_8173703/spark-21bedef6-1c5e-464e-9cb0-bb6903b3d84c/userFiles-a4929fdb-b634-4829-a7e3-00d82b0d521b/fetchFileTemp8739978227963911629.tmp ``` but when execute the udf, it use container's path ``` /mnt/ssd/2/yarn/nm-local-dir/usercache/typhoon.bo/appcache/application_1698132018785_8173703/container_e59_1698132018785_8173703_01_02/ ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
HyukjinKwon commented on code in PR #43936: URL: https://github.com/apache/spark/pull/43936#discussion_r1401465887 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging { logInfo(s"Added file $path at $key with timestamp $timestamp") // Fetch the file locally so that closures which are run on the driver can still use the // SparkFiles API to access files. - Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false) + Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true) Review Comment: Can you point out which code we assign the temp name? `doFetchFile(url, targetDir, fileName, conf, hadoopConf)` should preserve the file name -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
AngersZh commented on code in PR #43936: URL: https://github.com/apache/spark/pull/43936#discussion_r1401448409 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -1836,7 +1836,7 @@ class SparkContext(config: SparkConf) extends Logging { val uriToUse = if (!isLocal && scheme == "file") uri else new URI(key) val uriToDownload = UriBuilder.fromUri(uriToUse).fragment(null).build() val source = Utils.fetchFile(uriToDownload.toString, Utils.createTempDir(), conf, -hadoopConfiguration, timestamp, useCache = false, shouldUntar = false) +hadoopConfiguration, timestamp, useCache = true, shouldUntar = false) Review Comment: Seems already unpacked, remove this change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
AngersZh commented on code in PR #43936: URL: https://github.com/apache/spark/pull/43936#discussion_r1401447578 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging { logInfo(s"Added file $path at $key with timestamp $timestamp") // Fetch the file locally so that closures which are run on the driver can still use the // SparkFiles API to access files. - Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false) + Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true) Review Comment: Here is the log -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
AngersZh commented on code in PR #43936: URL: https://github.com/apache/spark/pull/43936#discussion_r1401447858 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -1836,7 +1836,7 @@ class SparkContext(config: SparkConf) extends Logging { val uriToUse = if (!isLocal && scheme == "file") uri else new URI(key) val uriToDownload = UriBuilder.fromUri(uriToUse).fragment(null).build() val source = Utils.fetchFile(uriToDownload.toString, Utils.createTempDir(), conf, -hadoopConfiguration, timestamp, useCache = false, shouldUntar = false) +hadoopConfiguration, timestamp, useCache = true, shouldUntar = false) Review Comment: Let me check about this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
AngersZh commented on code in PR #43936: URL: https://github.com/apache/spark/pull/43936#discussion_r1401447419 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging { logInfo(s"Added file $path at $key with timestamp $timestamp") // Fetch the file locally so that closures which are run on the driver can still use the // SparkFiles API to access files. - Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false) + Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true) Review Comment: > So are you saying that it doesn't fetch the file to `root` directory? didn't copy as same filename -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
HyukjinKwon commented on code in PR #43936: URL: https://github.com/apache/spark/pull/43936#discussion_r1401445511 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging { logInfo(s"Added file $path at $key with timestamp $timestamp") // Fetch the file locally so that closures which are run on the driver can still use the // SparkFiles API to access files. - Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false) + Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true) Review Comment: So are you saying that it doesn't fetch the file to `root` directory? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
HyukjinKwon commented on code in PR #43936: URL: https://github.com/apache/spark/pull/43936#discussion_r1401443129 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -1836,7 +1836,7 @@ class SparkContext(config: SparkConf) extends Logging { val uriToUse = if (!isLocal && scheme == "file") uri else new URI(key) val uriToDownload = UriBuilder.fromUri(uriToUse).fragment(null).build() val source = Utils.fetchFile(uriToDownload.toString, Utils.createTempDir(), conf, -hadoopConfiguration, timestamp, useCache = false, shouldUntar = false) +hadoopConfiguration, timestamp, useCache = true, shouldUntar = false) Review Comment: We don't need this archive file under root directory - the archive file itself is removed. we only need the unpacked directory -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]
AngersZh commented on code in PR #43936: URL: https://github.com/apache/spark/pull/43936#discussion_r1401427353 ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging { logInfo(s"Added file $path at $key with timestamp $timestamp") // Fetch the file locally so that closures which are run on the driver can still use the // SparkFiles API to access files. - Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false) + Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true) Review Comment: Executor log when `updateDependencies` ``` 23/11/21 17:44:55 INFO Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/fetchFileTemp5380393885914736245.tmp 23/11/21 17:44:55 INFO Utils: Copying /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/-17061381181700559593903_cache to /mnt/ssd/1/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/container_e59_1698132018785_8173703_01_000683/./feature_map.txt ``` In executor side, pass `useCache = true` when is not local mode, then executor will fetch the file to cache then copy cache file to root dir with filename. For sparkcontext dirver, current code pass `useCache=false` only fetch file as file temp ``` 23/11/21 17:39:53 INFO [pool-3-thread-2] SparkContext: Added file hdfs://path/feature_map.txt at hdfs://path/feature_map.txt with timestamp 1700559593903 23/11/21 17:39:54 INFO [pool-3-thread-2] Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/0/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-21bedef6-1c5e-464e-9cb0-bb6903b3d84c/userFiles-a4929fdb-b634-4829-a7e3-00d82b0d521b/fetchFileTemp8739978227963911629.tmp ``` So the added file won't exist under root dir with it's filename. The code of `Utils.fetchFile()` as below https://github.com/apache/spark/assets/46485123/68f6e2f9-a6e2-493d-bd65-d7b2cc88fadd;> It's clear that executor is local should pass `useCache=false` since in local mode, it should use file fetched by sc. But current code, sc won't add this file with it's file name. So I think should be like 1. SC add file should also copy file to root dir with the file name, then driver side also can get the file with file name then can run local task in driver 2. For non-local mode executor will also update the dependencies and work well 3. For local mode executor, it was started in driver process. It can use the file downloaded by `SC.addFile()` ## core/src/main/scala/org/apache/spark/SparkContext.scala: ## @@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging { logInfo(s"Added file $path at $key with timestamp $timestamp") // Fetch the file locally so that closures which are run on the driver can still use the // SparkFiles API to access files. - Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = false) + Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, timestamp, useCache = true) Review Comment: Executor log when `updateDependencies` ``` 23/11/21 17:44:55 INFO Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/fetchFileTemp5380393885914736245.tmp 23/11/21 17:44:55 INFO Utils: Copying /mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/-17061381181700559593903_cache to /mnt/ssd/1/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/container_e59_1698132018785_8173703_01_000683/./feature_map.txt ``` In executor side, pass `useCache = true` when is not local mode, then executor will fetch the file to cache then copy cache file to root dir with filename. For sparkcontext dirver, current code pass `useCache=false` only fetch file as file temp ``` 23/11/21 17:39:53 INFO [pool-3-thread-2] SparkContext: Added file hdfs://path/feature_map.txt at hdfs://path/feature_map.txt with timestamp 1700559593903 23/11/21 17:39:54 INFO [pool-3-thread-2] Utils: Fetching hdfs://path/feature_map.txt to /mnt/ssd/0/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-21bedef6-1c5e-464e-9cb0-bb6903b3d84c/userFiles-a4929fdb-b634-4829-a7e3-00d82b0d521b/fetchFileTemp8739978227963911629.tmp ``` So the added file won't exist under root dir with it's filename. The code of `Utils.fetchFile()`