Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2024-03-08 Thread via GitHub


github-actions[bot] closed pull request #43936: [SPARK-46034][CORE] 
SparkContext add file should also copy file to local root path
URL: https://github.com/apache/spark/pull/43936


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2024-03-07 Thread via GitHub


github-actions[bot] commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1984826472

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-27 Thread via GitHub


HyukjinKwon commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1829030542

   @AngersZh can you fix the PR description, and explain how this issue 
happens specifically in Yarn cluster mode? I still can't fully follow what 
where is the issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-27 Thread via GitHub


junyi1313 commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1829023594

   > The root cause is spark driver download file to it's `driverTempPath`, but 
didn't download to container's execution root path. So in yarn cluster mode, if 
we need to use the file in driver side, it throw FileNotFoundException. cc 
@HyukjinKwon @cloud-fan I have updated the code.
   
   I encounter the same issue too. I think this should be fixed asap.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-27 Thread via GitHub


tgravescs commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1828063684

   I have not used addFiles on yarn in a long time so can't speak to whether it 
got broken.   Generally speaking it's not recommended and user should pass 
files on submission.  Whatever happens here we need to make sure it works in 
both cluster and client modes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-27 Thread via GitHub


AngersZh commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1827485987

   I don't know why we add a `driverTmpDir`, remove `driverTmpDir` also can 
resolve this issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-27 Thread via GitHub


AngersZh commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1827474881

   > Could you please update the PR description? It looks overdue and 
inaccurate for me to catch up with.
   > 
   > Could you also provide the output for `LIST FILE` both before and after 
this? I guess we shall use its output at the caller-side, let's verify that 
they match.
   
   The result of `LIST FILES`, won't have impact since we update `addedFiles` 
with origin path
   ```
   spark-sql> add jar hdfs://path/search_hadoop_udf-1.0.0-SNAPSHOT.jar;
   Time taken: 3.055 seconds
   spark-sql> add file hdfs://path/feature_map.txt;
   spark-sql> list jars;
   hdfs://path/search_hadoop_udf-1.0.0-SNAPSHOT.jar
   Time taken: 0.443 seconds, Fetched 1 row(s)
   spark-sql> list files;
   hdfs://path/feature_map.txt
   Time taken: 0.428 seconds, Fetched 1 row(s)
   spark-sql>
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-26 Thread via GitHub


yaooqinn commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1827051972

   Could you please update the PR description? It looks overdue and inaccurate 
for me to catch up with.
   
   Could you also provide the output for `LIST FILE`? I guess we shall use its 
output at the caller-side, let's verify that they match.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-26 Thread via GitHub


AngersZh commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1827039250

   gentle ping @yaooqinn @cloud-fan @HyukjinKwon @tgravescs Could you take a 
look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-23 Thread via GitHub


AngersZh commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1825190497

   > Can we fix `SparkFiles.get` at driver side when Yarn cluster is used?
   
   `SparkFiles.get()` don't have problem. 
   The problem is user use relative path to use the added file.
   But `SparkContext.addFile` add to driverTempDir
   ```
 def getRootDirectory(): String =
   SparkEnv.get.driverTmpDir.getOrElse(".")
   ```
   
   But executing use `.` path to get file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-23 Thread via GitHub


HyukjinKwon commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1825009537

   Can we fix `SparkFiles.get` at driver side when Yarn cluster is used?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-23 Thread via GitHub


AngersZh commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1824131573

   Any more suggestion? cc @HyukjinKwon @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-22 Thread via GitHub


AngersZh commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401427353


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
   logInfo(s"Added file $path at $key with timestamp $timestamp")
   // Fetch the file locally so that closures which are run on the driver 
can still use the
   // SparkFiles API to access files.
-  Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, 
timestamp, useCache = false)
+  Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, 
timestamp, useCache = true)

Review Comment:
   Executor log when `updateDependencies`
   ```
   23/11/21 17:44:55 INFO Utils: Fetching hdfs://path/feature_map.txt to 
/mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/fetchFileTemp5380393885914736245.tmp
   23/11/21 17:44:55 INFO Utils: Copying 
/mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/-17061381181700559593903_cache
 to 
/mnt/ssd/1/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/container_e59_1698132018785_8173703_01_000683/./feature_map.txt
   ```
   
   In executor side, pass `useCache = true` when is not local mode, then 
executor will fetch the file to cache then copy cache file to root dir with 
filename.
   
   For sparkcontext dirver, current code pass `useCache=false` only fetch file 
as  file temp
   ```
   23/11/21 17:39:53 INFO [pool-3-thread-2] SparkContext: Added file 
hdfs://path/feature_map.txt at hdfs://path/feature_map.txt with timestamp 
1700559593903
   23/11/21 17:39:54 INFO [pool-3-thread-2] Utils: Fetching 
hdfs://path/feature_map.txt to 
/mnt/ssd/0/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-21bedef6-1c5e-464e-9cb0-bb6903b3d84c/userFiles-a4929fdb-b634-4829-a7e3-00d82b0d521b/fetchFileTemp8739978227963911629.tmp
   ```
   
   So the added file won't exist under root dir with it's filename.
   The code of `Utils.fetchFile()` as below
   https://github.com/apache/spark/assets/46485123/68f6e2f9-a6e2-493d-bd65-d7b2cc88fadd;>
   
   
   It's clear that executor is local should pass `useCache=false` since in 
local mode, it should use file fetched by sc.
   But current code, sc won't add this file with it's file name.
   
   So I think should be like
   
   1. SC add file should also copy file to root dir with the file name, then 
driver side also can get the file with file name then can run local task in 
driver
   2. For non-local mode executor will also update the dependencies and work 
well
   3. For local mode executor, it was started in driver process. It can use the 
file downloaded by `SC.addFile()`
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-22 Thread via GitHub


AngersZh commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1822478004

   ```
   Fetching 
hdfs://R2/projects/search_algo/hdfs/dev/typhoon.bo/uploader/ego_config/feature_map.txt
 to 
/mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/spark-32bf42ec-8de1-4f0e-aa60-def6d0f185c6/userFiles-61713d37-87ab-42d6-bdcd-065fa65ac8fc/fetchFileTemp8380676662376886133.tmp
   Moving 
/mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/spark-32bf42ec-8de1-4f0e-aa60-def6d0f185c6/userFiles-61713d37-87ab-42d6-bdcd-065fa65ac8fc/fetchFileTemp8380676662376886133.tmp
 to 
/mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/spark-32bf42ec-8de1-4f0e-aa60-def6d0f185c6/userFiles-61713d37-87ab-42d6-bdcd-065fa65ac8fc/feature_map.txt
   Fetching 
hdfs://R2/projects/search_algo/hdfs/dev/typhoon.bo/uploader/ego_config/feature_map.txt
 to 
/mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/container_e2006_1698132018785_8495989_01_01/./fetchFileTemp7654378954291310228.tmp
   Moving 
/mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/container_e2006_1698132018785_8495989_01_01/./fetchFileTemp7654378954291310228.tmp
 to 
/mnt/ssd/0/yarn/nm-local-dir/usercache/yi.zhu/appcache/application_1698132018785_8495989/container_e2006_1698132018785_8495989_01_01/./feature_map.txt
   ```
   
   Current code work well in when testing in our env.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-22 Thread via GitHub


AngersZh commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1822401850

   The root cause is spark driver download file to it's `driverTempPath`, but 
didn't download to container's execution root path. 
   So in yarn cluster mode, if we need to use the file in driver side, it throw 
FileNotFoundException. cc @HyukjinKwon @cloud-fan I have updated the code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-22 Thread via GitHub


AngersZh commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1822369346

   > cc @mridulm or @tgravescs have you ever seen such things before?: 
`SparkContext.addFiles` adds a file with a temporary name, and you cannot get 
it with the original name from `SparkFiles.get(...)`
   
   I have checked, this is a mistake,  since it use move file logic, so there 
is no log.
   The true reason is as  
https://github.com/apache/spark/pull/43936#issuecomment-1822339813


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-22 Thread via GitHub


HyukjinKwon commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1822356253

   cc @mridulm have you ever seen such things before?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-22 Thread via GitHub


AngersZh commented on PR #43936:
URL: https://github.com/apache/spark/pull/43936#issuecomment-1822339813

   Should be I make a mistake.
   
   In driver side, file was download and copy to path driverTempPath
   ```
   
/mnt/ssd/0/yarn/nm-local-dir/usercache/typhoon.bo/appcache/application_1698132018785_8173703/spark-21bedef6-1c5e-464e-9cb0-bb6903b3d84c/userFiles-a4929fdb-b634-4829-a7e3-00d82b0d521b/fetchFileTemp8739978227963911629.tmp
   ```
   
   but when execute the udf, it use container's path
   ```
   
/mnt/ssd/2/yarn/nm-local-dir/usercache/typhoon.bo/appcache/application_1698132018785_8173703/container_e59_1698132018785_8173703_01_02/
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-21 Thread via GitHub


HyukjinKwon commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401465887


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
   logInfo(s"Added file $path at $key with timestamp $timestamp")
   // Fetch the file locally so that closures which are run on the driver 
can still use the
   // SparkFiles API to access files.
-  Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, 
timestamp, useCache = false)
+  Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, 
timestamp, useCache = true)

Review Comment:
   Can you point out which code we assign the temp name? `doFetchFile(url, 
targetDir, fileName, conf, hadoopConf)` should preserve the file name



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-21 Thread via GitHub


AngersZh commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401448409


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -1836,7 +1836,7 @@ class SparkContext(config: SparkConf) extends Logging {
   val uriToUse = if (!isLocal && scheme == "file") uri else new URI(key)
   val uriToDownload = UriBuilder.fromUri(uriToUse).fragment(null).build()
   val source = Utils.fetchFile(uriToDownload.toString, 
Utils.createTempDir(), conf,
-hadoopConfiguration, timestamp, useCache = false, shouldUntar = false)
+hadoopConfiguration, timestamp, useCache = true, shouldUntar = false)

Review Comment:
   Seems already unpacked, remove this change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-21 Thread via GitHub


AngersZh commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401447578


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
   logInfo(s"Added file $path at $key with timestamp $timestamp")
   // Fetch the file locally so that closures which are run on the driver 
can still use the
   // SparkFiles API to access files.
-  Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, 
timestamp, useCache = false)
+  Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, 
timestamp, useCache = true)

Review Comment:
   Here is the log



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-21 Thread via GitHub


AngersZh commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401447858


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -1836,7 +1836,7 @@ class SparkContext(config: SparkConf) extends Logging {
   val uriToUse = if (!isLocal && scheme == "file") uri else new URI(key)
   val uriToDownload = UriBuilder.fromUri(uriToUse).fragment(null).build()
   val source = Utils.fetchFile(uriToDownload.toString, 
Utils.createTempDir(), conf,
-hadoopConfiguration, timestamp, useCache = false, shouldUntar = false)
+hadoopConfiguration, timestamp, useCache = true, shouldUntar = false)

Review Comment:
   Let me check about this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-21 Thread via GitHub


AngersZh commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401447419


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
   logInfo(s"Added file $path at $key with timestamp $timestamp")
   // Fetch the file locally so that closures which are run on the driver 
can still use the
   // SparkFiles API to access files.
-  Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, 
timestamp, useCache = false)
+  Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, 
timestamp, useCache = true)

Review Comment:
   > So are you saying that it doesn't fetch the file to `root` directory?
   
   didn't copy as same filename



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-21 Thread via GitHub


HyukjinKwon commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401445511


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
   logInfo(s"Added file $path at $key with timestamp $timestamp")
   // Fetch the file locally so that closures which are run on the driver 
can still use the
   // SparkFiles API to access files.
-  Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, 
timestamp, useCache = false)
+  Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, 
timestamp, useCache = true)

Review Comment:
   So are you saying that it doesn't fetch the file to `root` directory?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-21 Thread via GitHub


HyukjinKwon commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401443129


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -1836,7 +1836,7 @@ class SparkContext(config: SparkConf) extends Logging {
   val uriToUse = if (!isLocal && scheme == "file") uri else new URI(key)
   val uriToDownload = UriBuilder.fromUri(uriToUse).fragment(null).build()
   val source = Utils.fetchFile(uriToDownload.toString, 
Utils.createTempDir(), conf,
-hadoopConfiguration, timestamp, useCache = false, shouldUntar = false)
+hadoopConfiguration, timestamp, useCache = true, shouldUntar = false)

Review Comment:
   We don't need this archive file under root directory - the archive file 
itself is removed. we only need the unpacked directory



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-46034][CORE] SparkContext add file should also copy file to local root path [spark]

2023-11-21 Thread via GitHub


AngersZh commented on code in PR #43936:
URL: https://github.com/apache/spark/pull/43936#discussion_r1401427353


##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
   logInfo(s"Added file $path at $key with timestamp $timestamp")
   // Fetch the file locally so that closures which are run on the driver 
can still use the
   // SparkFiles API to access files.
-  Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, 
timestamp, useCache = false)
+  Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, 
timestamp, useCache = true)

Review Comment:
   Executor log when `updateDependencies`
   ```
   23/11/21 17:44:55 INFO Utils: Fetching hdfs://path/feature_map.txt to 
/mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/fetchFileTemp5380393885914736245.tmp
   23/11/21 17:44:55 INFO Utils: Copying 
/mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/-17061381181700559593903_cache
 to 
/mnt/ssd/1/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/container_e59_1698132018785_8173703_01_000683/./feature_map.txt
   ```
   
   In executor side, pass `useCache = true` when is not local mode, then 
executor will fetch the file to cache then copy cache file to root dir with 
filename.
   
   For sparkcontext dirver, current code pass `useCache=false` only fetch file 
as  file temp
   ```
   23/11/21 17:39:53 INFO [pool-3-thread-2] SparkContext: Added file 
hdfs://path/feature_map.txt at hdfs://path/feature_map.txt with timestamp 
1700559593903
   23/11/21 17:39:54 INFO [pool-3-thread-2] Utils: Fetching 
hdfs://path/feature_map.txt to 
/mnt/ssd/0/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-21bedef6-1c5e-464e-9cb0-bb6903b3d84c/userFiles-a4929fdb-b634-4829-a7e3-00d82b0d521b/fetchFileTemp8739978227963911629.tmp
   ```
   
   So the added file won't exist under root dir with it's filename.
   The code of `Utils.fetchFile()` as below
   https://github.com/apache/spark/assets/46485123/68f6e2f9-a6e2-493d-bd65-d7b2cc88fadd;>
   
   
   It's clear that executor is local should pass `useCache=false` since in 
local mode, it should use file fetched by sc.
   But current code, sc won't add this file with it's file name.
   
   So I think should be like
   
   1. SC add file should also copy file to root dir with the file name, then 
driver side also can get the file with file name then can run local task in 
driver
   2. For non-local mode executor will also update the dependencies and work 
well
   3. For local mode executor, it was started in driver process. It can use the 
file downloaded by `SC.addFile()`
   
   



##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -1822,7 +1822,7 @@ class SparkContext(config: SparkConf) extends Logging {
   logInfo(s"Added file $path at $key with timestamp $timestamp")
   // Fetch the file locally so that closures which are run on the driver 
can still use the
   // SparkFiles API to access files.
-  Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, 
timestamp, useCache = false)
+  Utils.fetchFile(uri.toString, root, conf, hadoopConfiguration, 
timestamp, useCache = true)

Review Comment:
   Executor log when `updateDependencies`
   ```
   23/11/21 17:44:55 INFO Utils: Fetching hdfs://path/feature_map.txt to 
/mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/fetchFileTemp5380393885914736245.tmp
   23/11/21 17:44:55 INFO Utils: Copying 
/mnt/ssd/2/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-e5d383fd-0064-44e8-850b-c2c1934a0ddf/-17061381181700559593903_cache
 to 
/mnt/ssd/1/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/container_e59_1698132018785_8173703_01_000683/./feature_map.txt
   ```
   
   In executor side, pass `useCache = true` when is not local mode, then 
executor will fetch the file to cache then copy cache file to root dir with 
filename.
   
   For sparkcontext dirver, current code pass `useCache=false` only fetch file 
as  file temp
   ```
   23/11/21 17:39:53 INFO [pool-3-thread-2] SparkContext: Added file 
hdfs://path/feature_map.txt at hdfs://path/feature_map.txt with timestamp 
1700559593903
   23/11/21 17:39:54 INFO [pool-3-thread-2] Utils: Fetching 
hdfs://path/feature_map.txt to 
/mnt/ssd/0/yarn/nm-local-dir/usercache/user/appcache/application_1698132018785_8173703/spark-21bedef6-1c5e-464e-9cb0-bb6903b3d84c/userFiles-a4929fdb-b634-4829-a7e3-00d82b0d521b/fetchFileTemp8739978227963911629.tmp
   ```
   
   So the added file won't exist under root dir with it's filename.
   The code of `Utils.fetchFile()`