[GitHub] spark pull request #21420: [SPARK-24377][Spark Submit] make --py-files work ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/21420 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21420: [SPARK-24377][Spark Submit] make --py-files work ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/21420#discussion_r191011438 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -430,18 +430,15 @@ private[spark] class SparkSubmit extends Logging { // Usage: PythonAppRunner [app arguments] args.mainClass = "org.apache.spark.deploy.PythonRunner" args.childArgs = ArrayBuffer(localPrimaryResource, localPyFiles) ++ args.childArgs -if (clusterManager != YARN) { - // The YARN backend distributes the primary file differently, so don't merge it. - args.files = mergeFileLists(args.files, args.primaryResource) -} } if (clusterManager != YARN) { // The YARN backend handles python files differently, so don't merge the lists. args.files = mergeFileLists(args.files, args.pyFiles) } - if (localPyFiles != null) { +} + +if (localPyFiles != null) { sparkConf.set("spark.submit.pyFiles", localPyFiles) --- End diff -- Looks indented too far now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21420: [SPARK-24377][Spark Submit] make --py-files work ...
Github user vanzin commented on a diff in the pull request: https://github.com/apache/spark/pull/21420#discussion_r191011981 --- Diff: core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala --- @@ -1093,6 +1097,44 @@ class SparkSubmitSuite assert(exception.getMessage() === "hello") } + test("support --py-files/spark.submit.pyFiles in non pyspark application") { +val hadoopConf = new Configuration() +updateConfWithFakeS3Fs(hadoopConf) + +val tmpDir = Utils.createTempDir() +val pyFile = File.createTempFile("tmpPy", ".egg", tmpDir) + +val args = Seq( + "--class", UserClasspathFirstTest.getClass.getName.stripPrefix("$"), + "--name", "testApp", + "--master", "yarn", + "--deploy-mode", "client", + "--py-files", s"s3a://${pyFile.getAbsolutePath}", + "spark-internal" +) + +val appArgs = new SparkSubmitArguments(args) +val (_, _, conf, _) = submit.prepareSubmitEnvironment(appArgs, conf = Some(hadoopConf)) + +conf.get("spark.yarn.dist.pyFiles") should be (s"s3a://${pyFile.getAbsolutePath}") +conf.get("spark.submit.pyFiles") should (startWith("/")) + +// Verify "spark.submit.pyFiles" +val args1 = Seq( + "--class", UserClasspathFirstTest.getClass.getName.stripPrefix("$"), + "--name", "testApp", + "--master", "yarn", + "--deploy-mode", "client", + "--conf", s"spark.submit.pyFiles=s3a://${pyFile.getAbsolutePath}", + "spark-internal" +) + +val appArgs1 = new SparkSubmitArguments(args1) +val (_, _, conf1, _) = submit.prepareSubmitEnvironment(appArgs1, conf = Some(hadoopConf)) + +conf1.get("spark.yarn.dist.pyFiles") should be (s"s3a://${pyFile.getAbsolutePath}") --- End diff -- use `PY_FILES.key`, also in other places. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21420: [SPARK-24377][Spark Submit] make --py-files work ...
Github user jerryshao commented on a diff in the pull request: https://github.com/apache/spark/pull/21420#discussion_r190783462 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -430,18 +430,15 @@ private[spark] class SparkSubmit extends Logging { // Usage: PythonAppRunner [app arguments] args.mainClass = "org.apache.spark.deploy.PythonRunner" args.childArgs = ArrayBuffer(localPrimaryResource, localPyFiles) ++ args.childArgs -if (clusterManager != YARN) { - // The YARN backend distributes the primary file differently, so don't merge it. - args.files = mergeFileLists(args.files, args.primaryResource) --- End diff -- it is duplicated with below code, you can check the original code. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21420: [SPARK-24377][Spark Submit] make --py-files work ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/21420#discussion_r190783213 --- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala --- @@ -430,18 +430,15 @@ private[spark] class SparkSubmit extends Logging { // Usage: PythonAppRunner [app arguments] args.mainClass = "org.apache.spark.deploy.PythonRunner" args.childArgs = ArrayBuffer(localPrimaryResource, localPyFiles) ++ args.childArgs -if (clusterManager != YARN) { - // The YARN backend distributes the primary file differently, so don't merge it. - args.files = mergeFileLists(args.files, args.primaryResource) --- End diff -- Eh @jerryshao why did we remove this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #21420: [SPARK-24377][Spark Submit] make --py-files work ...
GitHub user jerryshao opened a pull request: https://github.com/apache/spark/pull/21420 [SPARK-24377][Spark Submit] make --py-files work in non pyspark application ## What changes were proposed in this pull request? For some Spark applications, though they're a java program, they require not only jar dependencies, but also python dependencies. One example is Livy remote SparkContext application, this application is actually an embedded REPL for Scala/Python/R, it will not only load in jar dependencies, but also python and R deps, so we should specify not only "--jars", but also "--py-files". Currently for a Spark application, --py-files can only be worked for a pyspark application, so it will not be worked in the above case. So here propose to remove such restriction. Also we tested that "spark.submit.pyFiles" only supports quite limited scenario (client mode with local deps), so here also expand the usage of "spark.submit.pyFiles" to be alternative of --py-files. ## How was this patch tested? UT added. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jerryshao/apache-spark SPARK-24377 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/21420.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #21420 commit a41c99bf311aa8f4e0c2e07c1288f5a11e057ea4 Author: jerryshaoDate: 2018-05-24T06:53:23Z make --py-files work in non pyspark application --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org