[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15131 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79843763 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1426,7 +1426,7 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli * supported for Hadoop-supported filesystems. */ def addFile(path: String, recursive: Boolean): Unit = { -val uri = new URI(path) +val uri = new Path(path).toUri --- End diff -- ok thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79755939 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1426,7 +1426,7 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli * supported for Hadoop-supported filesystems. */ def addFile(path: String, recursive: Boolean): Unit = { -val uri = new URI(path) +val uri = new Path(path).toUri --- End diff -- I agree with @HyukjinKwon . Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79743521 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1426,7 +1426,7 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli * supported for Hadoop-supported filesystems. */ def addFile(path: String, recursive: Boolean): Unit = { -val uri = new URI(path) +val uri = new Path(path).toUri --- End diff -- I just check the documentation and tests. It seems Windows paths are being tested in https://github.com/apache/hadoop/blob/branch-2.7.2/hadoop-common-project/hadoop-common/src/test/java/org/apache/hadoop/fs/TestPath.java as `toString` it based on `URI` instance in `Path` which `toUri` will return directly, I guess it'd be safe. I could not find the explicit mention about Windows path in the documenation https://hadoop.apache.org/docs/stable2/api/index.html --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79735513 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1426,7 +1426,7 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli * supported for Hadoop-supported filesystems. */ def addFile(path: String, recursive: Boolean): Unit = { -val uri = new URI(path) +val uri = new Path(path).toUri --- End diff -- Meanwhile, we could alternatively use `Utils.resolveURI` as well which we are already testing within Spark. However, this util seems not hadling `C:/a/b/c` case (not `C:\a\b\c` which we should fix). So I suggested `Path (...).toUri` instead but if you feel strongly about this, we could use that. I will try to find and share doc and tests for `Path` as I got in my home though. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79732092 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1457,8 +1457,8 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli logInfo(s"Added file $path at $key with timestamp $timestamp") // Fetch the file locally so that closures which are run on the driver can still use the // SparkFiles API to access files. - Utils.fetchFile(path, new File(SparkFiles.getRootDirectory()), conf, env.securityManager, -hadoopConfiguration, timestamp, useCache = false) + Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf, --- End diff -- In my peraonal opinion, I thought it's okay (I thought about this for a while) because `Utils.fetchFile` wants the first argument as url not path. So, this might be a valid correction without tests because we fixed the argument to what the function initially wants. But no strong opinion. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79731142 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1426,7 +1426,7 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli * supported for Hadoop-supported filesystems. */ def addFile(path: String, recursive: Boolean): Unit = { -val uri = new URI(path) +val uri = new Path(path).toUri --- End diff -- I do understand your concern @felixcheung. However, IMHO, it'd be okay to not test Hadoop library within Spark. I will try to find some tests/documentation related with Windows path and then will share to make sure. FWIW, this case was verified by one of comitters before for Windows path. So, it'd be okay. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79723200 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1426,7 +1426,7 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli * supported for Hadoop-supported filesystems. */ def addFile(path: String, recursive: Boolean): Unit = { -val uri = new URI(path) +val uri = new Path(path).toUri --- End diff -- should there be some tests we can add for this change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79723221 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1457,8 +1457,8 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli logInfo(s"Added file $path at $key with timestamp $timestamp") // Fetch the file locally so that closures which are run on the driver can still use the // SparkFiles API to access files. - Utils.fetchFile(path, new File(SparkFiles.getRootDirectory()), conf, env.securityManager, -hadoopConfiguration, timestamp, useCache = false) + Utils.fetchFile(uri.toString, new File(SparkFiles.getRootDirectory()), conf, --- End diff -- should there be some tests we can add for this change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79723039 --- Diff: R/pkg/R/context.R --- @@ -225,6 +225,51 @@ setCheckpointDir <- function(sc, dirName) { invisible(callJMethod(sc, "setCheckpointDir", suppressWarnings(normalizePath(dirName } +#' Add a file or directory to be downloaded with this Spark job on every node. +#' +#' The path passed can be either a local file, a file in HDFS (or other Hadoop-supported +#' filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, +#' use spark.getSparkFiles(fileName) to find its download location. +#' +#' @rdname spark.addFile +#' @param path The path of the file to be added +#' @examples +#'\dontrun{ --- End diff -- add @export --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79722919 --- Diff: R/pkg/R/context.R --- @@ -225,6 +225,51 @@ setCheckpointDir <- function(sc, dirName) { invisible(callJMethod(sc, "setCheckpointDir", suppressWarnings(normalizePath(dirName } +#' Add a file or directory to be downloaded with this Spark job on every node. +#' +#' The path passed can be either a local file, a file in HDFS (or other Hadoop-supported +#' filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, +#' use spark.getSparkFiles(fileName) to find its download location. +#' +#' @rdname spark.addFile +#' @param path The path of the file to be added +#' @examples +#'\dontrun{ +#' spark.addFile("~/myfile") +#'} +#' @note spark.addFile since 2.1.0 +spark.addFile <- function(path) { + sc <- getSparkContext() + invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path +} + +#' Get the root directory that contains files added through spark.addFile. +#' +#' @rdname spark.getSparkFilesRootDirectory +#' @return the root directory that contains files added through spark.addFile +#' @examples +#'\dontrun{ +#' spark.getSparkFilesRootDirectory() +#'} +#' @note spark.getSparkFilesRootDirectory since 2.1.0 +spark.getSparkFilesRootDirectory <- function() { + callJStatic("org.apache.spark.SparkFiles", "getRootDirectory") +} + +#' Get the absolute path of a file added through spark.addFile. +#' +#' @rdname spark.getSparkFiles +#' @param fileName The name of the file added through spark.addFile +#' @return the absolute path of a file added through spark.addFile. +#' @examples --- End diff -- add @export --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79722962 --- Diff: R/pkg/R/context.R --- @@ -225,6 +225,51 @@ setCheckpointDir <- function(sc, dirName) { invisible(callJMethod(sc, "setCheckpointDir", suppressWarnings(normalizePath(dirName } +#' Add a file or directory to be downloaded with this Spark job on every node. +#' +#' The path passed can be either a local file, a file in HDFS (or other Hadoop-supported +#' filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, +#' use spark.getSparkFiles(fileName) to find its download location. +#' +#' @rdname spark.addFile +#' @param path The path of the file to be added +#' @examples +#'\dontrun{ +#' spark.addFile("~/myfile") +#'} +#' @note spark.addFile since 2.1.0 +spark.addFile <- function(path) { + sc <- getSparkContext() + invisible(callJMethod(sc, "addFile", suppressWarnings(normalizePath(path +} + +#' Get the root directory that contains files added through spark.addFile. +#' +#' @rdname spark.getSparkFilesRootDirectory +#' @return the root directory that contains files added through spark.addFile +#' @examples --- End diff -- add @export --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79307283 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1426,7 +1426,7 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli * supported for Hadoop-supported filesystems. */ def addFile(path: String, recursive: Boolean): Unit = { -val uri = new URI(path) +val uri = new Path(path).toUri val schemeCorrectedPath = uri.getScheme match { case null | "local" => new File(path).getCanonicalFile.toURI.toString case _ => path --- End diff -- Updated, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79306433 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1426,7 +1426,7 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli * supported for Hadoop-supported filesystems. */ def addFile(path: String, recursive: Boolean): Unit = { -val uri = new URI(path) +val uri = new Path(path).toUri val schemeCorrectedPath = uri.getScheme match { case null | "local" => new File(path).getCanonicalFile.toURI.toString case _ => path --- End diff -- @yanboliang Yeap, it passes the tests at least. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79305494 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1426,7 +1426,7 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli * supported for Hadoop-supported filesystems. */ def addFile(path: String, recursive: Boolean): Unit = { -val uri = new URI(path) +val uri = new Path(path).toUri val schemeCorrectedPath = uri.getScheme match { case null | "local" => new File(path).getCanonicalFile.toURI.toString case _ => path --- End diff -- Just in case, I ran the tests after manually fixing this. Maybe we can wait for the result - https://ci.appveyor.com/project/HyukjinKwon/spark/build/108-pr-15131-path --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79305450 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1426,7 +1426,7 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli * supported for Hadoop-supported filesystems. */ def addFile(path: String, recursive: Boolean): Unit = { -val uri = new URI(path) +val uri = new Path(path).toUri val schemeCorrectedPath = uri.getScheme match { case null | "local" => new File(path).getCanonicalFile.toURI.toString case _ => path --- End diff -- In `Utils.fetchFile(path, ...)` below, it seems we can't pass path as it is because `new URI(path)` is called internally which fails to be parsed in case of Windows path. Could I please ask to change this to `uri.toString`? It'd work fine as far as I know. ```scala import java.net.URI import org.apache.hadoop.fs.Path scala> val a = new Path("C:\\a\\b\\c").toUri a: java.net.URI = /C:/a/b/c scala> val b = new URI(a.toString) b: java.net.URI = /C:/a/b/c scala> val a = new Path("C:/a/b/c").toUri a: java.net.URI = /C:/a/b/c scala> val b = new URI(a.toString) b: java.net.URI = /C:/a/b/c scala> val a = new Path("/a/b/c").toUri a: java.net.URI = /a/b/c scala> val b = new URI(a.toString) b: java.net.URI = /a/b/c scala> val a = new Path("file:///a/b/c").toUri a: java.net.URI = file:///a/b/c scala> val b = new URI(a.toString) b: java.net.URI = file:///a/b/c scala> new Path("http://localhost/a/b/c;).toUri a: java.net.URI = http://localhost/a/b/c scala> val b = new URI(a.toString) b: java.net.URI = http://localhost/a/b/c ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79303180 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1426,41 +1426,53 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli * supported for Hadoop-supported filesystems. */ def addFile(path: String, recursive: Boolean): Unit = { -val uri = new URI(path) -val schemeCorrectedPath = uri.getScheme match { - case null | "local" => new File(path).getCanonicalFile.toURI.toString - case _ => path -} - -val hadoopPath = new Path(schemeCorrectedPath) -val scheme = new URI(schemeCorrectedPath).getScheme -if (!Array("http", "https", "ftp").contains(scheme)) { - val fs = hadoopPath.getFileSystem(hadoopConfiguration) - val isDir = fs.getFileStatus(hadoopPath).isDirectory - if (!isLocal && scheme == "file" && isDir) { -throw new SparkException(s"addFile does not support local directories when not running " + - "local mode.") +if (path == null) { + logWarning("null specified as parameter to addFile") +} else { + var key = "" + if (path.contains("\\")) { +// For local paths with backslashes on Windows, URI throws an exception +key = env.rpcEnv.fileServer.addFile(new File(path)) + } else { +val uri = new URI(path) +val schemeCorrectedPath = uri.getScheme match { + case null | "local" => new File(path).getCanonicalFile.toURI.toString + case _ => path +} + +val hadoopPath = new Path(schemeCorrectedPath) +val scheme = new URI(schemeCorrectedPath).getScheme +if (!Array("http", "https", "ftp").contains(scheme)) { + val fs = hadoopPath.getFileSystem(hadoopConfiguration) + val isDir = fs.getFileStatus(hadoopPath).isDirectory + if (!isLocal && scheme == "file" && isDir) { +throw new SparkException(s"addFile does not support local directories when not " + + "running local mode.") + } + if (!recursive && isDir) { +throw new SparkException(s"Added file $hadoopPath is a directory and recursive " + + "is not turned on.") + } +} + +key = if (!isLocal && scheme == "file") { + env.rpcEnv.fileServer.addFile(new File(uri.getPath)) +} else { + schemeCorrectedPath +} } - if (!recursive && isDir) { -throw new SparkException(s"Added file $hadoopPath is a directory and recursive is not " + - "turned on.") + if (key != null) { +val timestamp = System.currentTimeMillis +if (addedFiles.putIfAbsent(key, timestamp).isEmpty) { + logInfo(s"Added file $path at $key with timestamp $timestamp") + // Fetch the file locally so that closures which are run on the driver can still use the + // SparkFiles API to access files. + Utils.fetchFile(path, new File(SparkFiles.getRootDirectory()), conf, env.securityManager, --- End diff -- @HyukjinKwon Thanks for your suggestion, will update it soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user felixcheung commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79303139 --- Diff: R/pkg/R/context.R --- @@ -225,6 +225,42 @@ setCheckpointDir <- function(sc, dirName) { invisible(callJMethod(sc, "setCheckpointDir", suppressWarnings(normalizePath(dirName } +#' Add a file or directory to be downloaded with this Spark job on every node. +#' +#' The path passed can be either a local file, a file in HDFS (or other Hadoop-supported +#' filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, +#' use sparkFiles.get(fileName) to find its download location. +#' +#' A directory can be given if the recursive option is set to true. +#' Currently directories are only supported for Hadoop-supported filesystems. +#' +#' @param path The path of the file to be added +#' @param recursive Recursive or not if the path is directory. Default is FALSE. --- End diff -- this param is not in below? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79302542 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1426,41 +1426,53 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli * supported for Hadoop-supported filesystems. */ def addFile(path: String, recursive: Boolean): Unit = { -val uri = new URI(path) -val schemeCorrectedPath = uri.getScheme match { - case null | "local" => new File(path).getCanonicalFile.toURI.toString - case _ => path -} - -val hadoopPath = new Path(schemeCorrectedPath) -val scheme = new URI(schemeCorrectedPath).getScheme -if (!Array("http", "https", "ftp").contains(scheme)) { - val fs = hadoopPath.getFileSystem(hadoopConfiguration) - val isDir = fs.getFileStatus(hadoopPath).isDirectory - if (!isLocal && scheme == "file" && isDir) { -throw new SparkException(s"addFile does not support local directories when not running " + - "local mode.") +if (path == null) { + logWarning("null specified as parameter to addFile") +} else { + var key = "" + if (path.contains("\\")) { +// For local paths with backslashes on Windows, URI throws an exception +key = env.rpcEnv.fileServer.addFile(new File(path)) + } else { +val uri = new URI(path) +val schemeCorrectedPath = uri.getScheme match { + case null | "local" => new File(path).getCanonicalFile.toURI.toString + case _ => path +} + +val hadoopPath = new Path(schemeCorrectedPath) +val scheme = new URI(schemeCorrectedPath).getScheme +if (!Array("http", "https", "ftp").contains(scheme)) { + val fs = hadoopPath.getFileSystem(hadoopConfiguration) + val isDir = fs.getFileStatus(hadoopPath).isDirectory + if (!isLocal && scheme == "file" && isDir) { +throw new SparkException(s"addFile does not support local directories when not " + + "running local mode.") + } + if (!recursive && isDir) { +throw new SparkException(s"Added file $hadoopPath is a directory and recursive " + + "is not turned on.") + } +} + +key = if (!isLocal && scheme == "file") { + env.rpcEnv.fileServer.addFile(new File(uri.getPath)) +} else { + schemeCorrectedPath +} } - if (!recursive && isDir) { -throw new SparkException(s"Added file $hadoopPath is a directory and recursive is not " + - "turned on.") + if (key != null) { +val timestamp = System.currentTimeMillis +if (addedFiles.putIfAbsent(key, timestamp).isEmpty) { + logInfo(s"Added file $path at $key with timestamp $timestamp") + // Fetch the file locally so that closures which are run on the driver can still use the + // SparkFiles API to access files. + Utils.fetchFile(path, new File(SparkFiles.getRootDirectory()), conf, env.securityManager, --- End diff -- (@yanboliang BTW - I don't mind triggering builds manually. Please feel free to submit more commits for test purposes if you will) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15131: [SPARK-17577][SparkR][Core] SparkR support add fi...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/15131#discussion_r79302302 --- Diff: core/src/main/scala/org/apache/spark/SparkContext.scala --- @@ -1426,41 +1426,53 @@ class SparkContext(config: SparkConf) extends Logging with ExecutorAllocationCli * supported for Hadoop-supported filesystems. */ def addFile(path: String, recursive: Boolean): Unit = { -val uri = new URI(path) -val schemeCorrectedPath = uri.getScheme match { - case null | "local" => new File(path).getCanonicalFile.toURI.toString - case _ => path -} - -val hadoopPath = new Path(schemeCorrectedPath) -val scheme = new URI(schemeCorrectedPath).getScheme -if (!Array("http", "https", "ftp").contains(scheme)) { - val fs = hadoopPath.getFileSystem(hadoopConfiguration) - val isDir = fs.getFileStatus(hadoopPath).isDirectory - if (!isLocal && scheme == "file" && isDir) { -throw new SparkException(s"addFile does not support local directories when not running " + - "local mode.") +if (path == null) { + logWarning("null specified as parameter to addFile") +} else { + var key = "" + if (path.contains("\\")) { +// For local paths with backslashes on Windows, URI throws an exception +key = env.rpcEnv.fileServer.addFile(new File(path)) + } else { +val uri = new URI(path) +val schemeCorrectedPath = uri.getScheme match { + case null | "local" => new File(path).getCanonicalFile.toURI.toString + case _ => path +} + +val hadoopPath = new Path(schemeCorrectedPath) +val scheme = new URI(schemeCorrectedPath).getScheme +if (!Array("http", "https", "ftp").contains(scheme)) { + val fs = hadoopPath.getFileSystem(hadoopConfiguration) + val isDir = fs.getFileStatus(hadoopPath).isDirectory + if (!isLocal && scheme == "file" && isDir) { +throw new SparkException(s"addFile does not support local directories when not " + + "running local mode.") + } + if (!recursive && isDir) { +throw new SparkException(s"Added file $hadoopPath is a directory and recursive " + + "is not turned on.") + } +} + +key = if (!isLocal && scheme == "file") { + env.rpcEnv.fileServer.addFile(new File(uri.getPath)) +} else { + schemeCorrectedPath +} } - if (!recursive && isDir) { -throw new SparkException(s"Added file $hadoopPath is a directory and recursive is not " + - "turned on.") + if (key != null) { +val timestamp = System.currentTimeMillis +if (addedFiles.putIfAbsent(key, timestamp).isEmpty) { + logInfo(s"Added file $path at $key with timestamp $timestamp") + // Fetch the file locally so that closures which are run on the driver can still use the + // SparkFiles API to access files. + Utils.fetchFile(path, new File(SparkFiles.getRootDirectory()), conf, env.securityManager, --- End diff -- Ah, sorry, I should've checked the codes further. It seems `Utils.fetchFile` is asking `url` as the first argument not path. How about changing `val uri = new URI(path)` to `val uri = new Path(path).toUri` rather than resembling `addJar`? I remember we discussed this problem in https://github.com/apache/spark/pull/14960. If comma separated multiple paths are allow in `path`, it will be problematic but it seems the `path` is only allowed as a single. ```scala import java.net.URI import org.apache.hadoop.fs.Path scala> new Path("C:\\a\\b\\c").toUri res1: java.net.URI = /C:/a/b/c scala> new Path("C:/a/b/c").toUri res2: java.net.URI = /C:/a/b/c scala> new Path("/a/b/c").toUri res3: java.net.URI = /a/b/c scala> new Path("file:///a/b/c").toUri res3: java.net.URI = file:///a/b/c scala> new Path("http://localhost/a/b/c;).toUri res3: java.net.URI = http://localhost/a/b/c ``` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org