[GitHub] liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS
liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS URL: https://github.com/apache/spark/pull/23546#discussion_r249990090 ## File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala ## @@ -71,17 +75,64 @@ private[spark] object KubernetesUtils extends Logging { * - File URIs with scheme local:// resolve to just the path of the URI. * - Otherwise, the URIs are returned as-is. */ - def resolveFileUrisAndPath(fileUris: Iterable[String]): Iterable[String] = { + def resolveFileUrisAndPath(fileUris: Iterable[String], conf: Option[SparkConf] = None) + : Iterable[String] = { fileUris.map { uri => - resolveFileUri(uri) + resolveFileUri(uri, conf) } } - def resolveFileUri(uri: String): String = { + /** + * Get the final path for a client file, if not return the uri as is. + * + */ + def getDestPathIfClientFile(uri: String, conf: SparkConf): String = { +val fileUri = Utils.resolveURI(uri) +val fileScheme = Option(fileUri.getScheme).getOrElse("file") +if (fileScheme == "client") { + if (conf.get(KUBERNETES_FILE_UPLOAD_PATH).isDefined) { +val uploadPath = conf.get(KUBERNETES_FILE_UPLOAD_PATH).get +s"${uploadPath}/${fileUri.getPath.split("/").last}" Review comment: Please document this clearly, i.e., all client-side dependencies will be uploaded to the given path with a flat directory structure. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS
liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS URL: https://github.com/apache/spark/pull/23546#discussion_r249989440 ## File path: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ## @@ -373,6 +407,49 @@ private[spark] class SparkSubmit extends Logging { localPyFiles = Option(args.pyFiles).map { downloadFileList(_, targetDir, sparkConf, hadoopConf, secMgr) }.orNull + + if (isKubernetesClient && +sparkConf.getBoolean("spark.kubernetes.submitInDriver", false)) { Review comment: Got it. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS
liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS URL: https://github.com/apache/spark/pull/23546#discussion_r249117347 ## File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala ## @@ -71,17 +75,64 @@ private[spark] object KubernetesUtils extends Logging { * - File URIs with scheme local:// resolve to just the path of the URI. * - Otherwise, the URIs are returned as-is. */ - def resolveFileUrisAndPath(fileUris: Iterable[String]): Iterable[String] = { + def resolveFileUrisAndPath(fileUris: Iterable[String], conf: Option[SparkConf] = None) + : Iterable[String] = { fileUris.map { uri => - resolveFileUri(uri) + resolveFileUri(uri, conf) } } - def resolveFileUri(uri: String): String = { + /** + * Get the final path for a client file, if not return the uri as is. + * + */ + def getDestPathIfClientFile(uri: String, conf: SparkConf): String = { +val fileUri = Utils.resolveURI(uri) +val fileScheme = Option(fileUri.getScheme).getOrElse("file") +if (fileScheme == "client") { + if (conf.get(KUBERNETES_FILE_UPLOAD_PATH).isDefined) { +val uploadPath = conf.get(KUBERNETES_FILE_UPLOAD_PATH).get +s"${uploadPath}/${fileUri.getPath.split("/").last}" + } else { +throw new SparkException("Please specify " + + "spark.kubernetes.file.upload.path property...") + } +} else { + uri +} + } + + /** + * Resolves a uri according to its scheme. If scheme is client + * then uploads the file to the HCFS. + */ + def resolveFileUri(uri: String, conf: Option[SparkConf] = None): String = { val fileUri = Utils.resolveURI(uri) val fileScheme = Option(fileUri.getScheme).getOrElse("file") fileScheme match { case "local" => fileUri.getPath + case KUBERNETES_FILE_UPLOAD_SCHEME => +conf match { + case Some(sConf) => +if (sConf.get(KUBERNETES_FILE_UPLOAD_PATH).isDefined) { + try { +val hadoopConf = SparkHadoopUtil.get.newConfiguration(sConf) +val uploadPath = sConf.get(KUBERNETES_FILE_UPLOAD_PATH).get +val fs = getHadoopFileSystem(Utils.resolveURI(uploadPath), hadoopConf) +val storePath = new Path(s"${uploadPath}/${fileUri.getPath.split("/").last}") +log.info(s"Uploading file: ${fileUri.getPath}...") Review comment: We should also mention the destination path in the log message. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS
liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS URL: https://github.com/apache/spark/pull/23546#discussion_r249120565 ## File path: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ## @@ -373,6 +407,49 @@ private[spark] class SparkSubmit extends Logging { localPyFiles = Option(args.pyFiles).map { downloadFileList(_, targetDir, sparkConf, hadoopConf, secMgr) }.orNull + + if (isKubernetesClient && +sparkConf.getBoolean("spark.kubernetes.submitInDriver", false)) { Review comment: What's the purpose of checking `spark.kubernetes.submitInDriver`, which AFAIK is used to indicate the cluster mode. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS
liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS URL: https://github.com/apache/spark/pull/23546#discussion_r249114166 ## File path: core/src/main/scala/org/apache/spark/util/Utils.scala ## @@ -714,6 +714,24 @@ private[spark] object Utils extends Logging { } } + /** + * Upload a file to a Hadoop-compatible filesystem. + * + */ + def uploadHcfsFile( Review comment: https://wiki.apache.org/hadoop/HCFS, which seems to become obsolete. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS
liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS URL: https://github.com/apache/spark/pull/23546#discussion_r249118870 ## File path: docs/running-on-kubernetes.md ## @@ -1010,6 +1024,15 @@ See the below table for the full list of pod specifications that will be overwri Spark will add additional labels specified by the spark configuration. + Review comment: This should be added to the table under the section `Spark Properties`. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS
liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS URL: https://github.com/apache/spark/pull/23546#discussion_r249114851 ## File path: docs/running-on-kubernetes.md ## @@ -193,8 +193,23 @@ If your application's dependencies are all hosted in remote locations like HDFS by their appropriate remote URIs. Also, application dependencies can be pre-mounted into custom-built Docker images. Those dependencies can be added to the classpath by referencing them with `local://` URIs and/or setting the `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles. The `local://` scheme is also required when referring to -dependencies in custom-built Docker images in `spark-submit`. Note that using application dependencies from the submission -client's local file system is currently not yet supported. +dependencies in custom-built Docker images in `spark-submit`. We support dependencies from the submission +client's local file system using the scheme `client://` where the destination should be a Hadoop compatibile fs. +A typical example of this using S3 is via passing the following options: + +``` +... +--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.6 +--conf spark.hadoop.fs.s3a.access.key=... +--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem +--conf spark.hadoop.fs.s3a.fast.upload=true +--conf spark.hadoop.fs.s3a.secret.key= + client:///full/path/to/app.jar +``` Review comment: How does the submission client the user's intention is to upload to S3 instead of say an HDFS cluster? I don't think this can be determined 100% sure only based on the present of those `s3a` options. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS
liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS URL: https://github.com/apache/spark/pull/23546#discussion_r249118375 ## File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala ## @@ -71,17 +75,64 @@ private[spark] object KubernetesUtils extends Logging { * - File URIs with scheme local:// resolve to just the path of the URI. * - Otherwise, the URIs are returned as-is. */ - def resolveFileUrisAndPath(fileUris: Iterable[String]): Iterable[String] = { + def resolveFileUrisAndPath(fileUris: Iterable[String], conf: Option[SparkConf] = None) + : Iterable[String] = { fileUris.map { uri => - resolveFileUri(uri) + resolveFileUri(uri, conf) } } - def resolveFileUri(uri: String): String = { + /** + * Get the final path for a client file, if not return the uri as is. + * Review comment: This empty comment line can be removed. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS
liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS URL: https://github.com/apache/spark/pull/23546#discussion_r249115464 ## File path: docs/running-on-kubernetes.md ## @@ -193,8 +193,23 @@ If your application's dependencies are all hosted in remote locations like HDFS by their appropriate remote URIs. Also, application dependencies can be pre-mounted into custom-built Docker images. Those dependencies can be added to the classpath by referencing them with `local://` URIs and/or setting the `SPARK_EXTRA_CLASSPATH` environment variable in your Dockerfiles. The `local://` scheme is also required when referring to -dependencies in custom-built Docker images in `spark-submit`. Note that using application dependencies from the submission -client's local file system is currently not yet supported. +dependencies in custom-built Docker images in `spark-submit`. We support dependencies from the submission +client's local file system using the scheme `client://` where the destination should be a Hadoop compatibile fs. +A typical example of this using S3 is via passing the following options: + +``` +... +--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.6 +--conf spark.hadoop.fs.s3a.access.key=... +--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem +--conf spark.hadoop.fs.s3a.fast.upload=true +--conf spark.hadoop.fs.s3a.secret.key= + client:///full/path/to/app.jar +``` Review comment: I saw you have `spark.kubernetes.file.upload.path` below, which should also be added here as an example. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS
liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS URL: https://github.com/apache/spark/pull/23546#discussion_r249115818 ## File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala ## @@ -289,6 +289,12 @@ private[spark] object Config extends Logging { .booleanConf .createWithDefault(true) + val KUBERNETES_FILE_UPLOAD_PATH = +ConfigBuilder("spark.kubernetes.file.upload.path") + .doc("HCFS path to upload files to, using the client scheme:// in cluster mode.") Review comment: `HCFS path where files with the client:// scheme will be uploded to in cluster mode.` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS
liyinan926 commented on a change in pull request #23546: [SPARK-23153][K8s] Support client dependencies with a HCFS URL: https://github.com/apache/spark/pull/23546#discussion_r249116345 ## File path: resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/KubernetesUtils.scala ## @@ -71,17 +75,64 @@ private[spark] object KubernetesUtils extends Logging { * - File URIs with scheme local:// resolve to just the path of the URI. * - Otherwise, the URIs are returned as-is. */ - def resolveFileUrisAndPath(fileUris: Iterable[String]): Iterable[String] = { + def resolveFileUrisAndPath(fileUris: Iterable[String], conf: Option[SparkConf] = None) + : Iterable[String] = { fileUris.map { uri => - resolveFileUri(uri) + resolveFileUri(uri, conf) } } - def resolveFileUri(uri: String): String = { + /** + * Get the final path for a client file, if not return the uri as is. + * + */ + def getDestPathIfClientFile(uri: String, conf: SparkConf): String = { +val fileUri = Utils.resolveURI(uri) +val fileScheme = Option(fileUri.getScheme).getOrElse("file") +if (fileScheme == "client") { + if (conf.get(KUBERNETES_FILE_UPLOAD_PATH).isDefined) { +val uploadPath = conf.get(KUBERNETES_FILE_UPLOAD_PATH).get +s"${uploadPath}/${fileUri.getPath.split("/").last}" Review comment: So a file `client://path/to/app1.jar` will be uploaded to `${uploadPath}/app1.jar`? What if two client-local files at different local paths have the same file name? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org