[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-07-06 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/18235


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-07-06 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r125828051
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -520,7 +520,7 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
 |  (Default: client).
 |  --class CLASS_NAME  Your application's main class (for 
Java / Scala apps).
 |  --name NAME A name of your application.
-|  --jars JARS Comma-separated list of local jars 
to include on the driver
--- End diff --

Thanks @cloud-fan for your review. Supporting remote jar has been addressed 
in #18078 , but it still left some codes and docs should be updated, also it 
doesn't support downloading jars from HTTP(S), FTP, here also adding this 
support.

So basically this PR address the left problem of #18078 and add glob 
support.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-07-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r125825762
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -520,7 +520,7 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
 |  (Default: client).
 |  --class CLASS_NAME  Your application's main class (for 
Java / Scala apps).
 |  --name NAME A name of your application.
-|  --jars JARS Comma-separated list of local jars 
to include on the driver
--- End diff --

and we should also add a test case for it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-07-06 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r125825030
  
--- Diff: 
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala ---
@@ -520,7 +520,7 @@ private[deploy] class SparkSubmitArguments(args: 
Seq[String], env: Map[String, S
 |  (Default: client).
 |  --class CLASS_NAME  Your application's main class (for 
Java / Scala apps).
 |  --name NAME A name of your application.
-|  --jars JARS Comma-separated list of local jars 
to include on the driver
--- End diff --

This is a new feature(support non-local jars), shall we create a separated 
jira ticket?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-29 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r124950848
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -841,36 +845,132 @@ object SparkSubmit extends CommandLineUtils {
* Download a list of remote files to temp local files. If the file is 
local, the original file
* will be returned.
* @param fileList A comma separated file list.
+   * @param targetDir A temporary directory for which downloaded files
+   * @param sparkProperties Spark properties
* @return A comma separated local files list.
*/
   private[deploy] def downloadFileList(
   fileList: String,
+  targetDir: File,
+  sparkProperties: Map[String, String],
   hadoopConf: HadoopConfiguration): String = {
 require(fileList != null, "fileList cannot be null.")
-fileList.split(",").map(downloadFile(_, hadoopConf)).mkString(",")
+fileList.split(",")
+  .map(downloadFile(_, targetDir, sparkProperties, hadoopConf))
+  .mkString(",")
   }
 
   /**
* Download a file from the remote to a local temporary directory. If 
the input path points to
* a local path, returns it with no operation.
+   * @param path A file path from where the files will be downloaded.
+   * @param targetDir A temporary directory for which downloaded files
+   * @param sparkProperties Spark properties
+   * @return A comma separated local files list.
*/
-  private[deploy] def downloadFile(path: String, hadoopConf: 
HadoopConfiguration): String = {
+  private[deploy] def downloadFile(
+  path: String,
+  targetDir: File,
+  sparkProperties: Map[String, String],
+  hadoopConf: HadoopConfiguration): String = {
 require(path != null, "path cannot be null.")
 val uri = Utils.resolveURI(path)
 uri.getScheme match {
-  case "file" | "local" =>
-path
+  case "file" | "local" => path
+  case "http" | "https" | "ftp" =>
+val uc = uri.toURL.openConnection()
+uc match {
+  case https: HttpsURLConnection =>
+val trustStore = sparkProperties.get("spark.ssl.fs.trustStore")
+  .orElse(sparkProperties.get("spark.ssl.trustStore"))
--- End diff --

make sense, let's keep this way.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-29 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r124949735
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -841,36 +845,132 @@ object SparkSubmit extends CommandLineUtils {
* Download a list of remote files to temp local files. If the file is 
local, the original file
* will be returned.
* @param fileList A comma separated file list.
+   * @param targetDir A temporary directory for which downloaded files
+   * @param sparkProperties Spark properties
* @return A comma separated local files list.
*/
   private[deploy] def downloadFileList(
   fileList: String,
+  targetDir: File,
+  sparkProperties: Map[String, String],
   hadoopConf: HadoopConfiguration): String = {
 require(fileList != null, "fileList cannot be null.")
-fileList.split(",").map(downloadFile(_, hadoopConf)).mkString(",")
+fileList.split(",")
+  .map(downloadFile(_, targetDir, sparkProperties, hadoopConf))
+  .mkString(",")
   }
 
   /**
* Download a file from the remote to a local temporary directory. If 
the input path points to
* a local path, returns it with no operation.
+   * @param path A file path from where the files will be downloaded.
+   * @param targetDir A temporary directory for which downloaded files
+   * @param sparkProperties Spark properties
+   * @return A comma separated local files list.
*/
-  private[deploy] def downloadFile(path: String, hadoopConf: 
HadoopConfiguration): String = {
+  private[deploy] def downloadFile(
+  path: String,
+  targetDir: File,
+  sparkProperties: Map[String, String],
+  hadoopConf: HadoopConfiguration): String = {
 require(path != null, "path cannot be null.")
 val uri = Utils.resolveURI(path)
 uri.getScheme match {
-  case "file" | "local" =>
-path
+  case "file" | "local" => path
+  case "http" | "https" | "ftp" =>
+val uc = uri.toURL.openConnection()
+uc match {
+  case https: HttpsURLConnection =>
+val trustStore = sparkProperties.get("spark.ssl.fs.trustStore")
+  .orElse(sparkProperties.get("spark.ssl.trustStore"))
--- End diff --

These configurations are dynamic configurations based on the component 
`spark.ssl.${ns}.xxx`, It is not easy to leverage `internal/config`. Let me 
think about it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-29 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r124949549
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -841,36 +845,132 @@ object SparkSubmit extends CommandLineUtils {
* Download a list of remote files to temp local files. If the file is 
local, the original file
* will be returned.
* @param fileList A comma separated file list.
+   * @param targetDir A temporary directory for which downloaded files
+   * @param sparkProperties Spark properties
* @return A comma separated local files list.
*/
   private[deploy] def downloadFileList(
   fileList: String,
+  targetDir: File,
+  sparkProperties: Map[String, String],
   hadoopConf: HadoopConfiguration): String = {
 require(fileList != null, "fileList cannot be null.")
-fileList.split(",").map(downloadFile(_, hadoopConf)).mkString(",")
+fileList.split(",")
+  .map(downloadFile(_, targetDir, sparkProperties, hadoopConf))
+  .mkString(",")
   }
 
   /**
* Download a file from the remote to a local temporary directory. If 
the input path points to
* a local path, returns it with no operation.
+   * @param path A file path from where the files will be downloaded.
+   * @param targetDir A temporary directory for which downloaded files
+   * @param sparkProperties Spark properties
+   * @return A comma separated local files list.
*/
-  private[deploy] def downloadFile(path: String, hadoopConf: 
HadoopConfiguration): String = {
+  private[deploy] def downloadFile(
+  path: String,
+  targetDir: File,
+  sparkProperties: Map[String, String],
+  hadoopConf: HadoopConfiguration): String = {
 require(path != null, "path cannot be null.")
 val uri = Utils.resolveURI(path)
 uri.getScheme match {
-  case "file" | "local" =>
-path
+  case "file" | "local" => path
+  case "http" | "https" | "ftp" =>
+val uc = uri.toURL.openConnection()
+uc match {
+  case https: HttpsURLConnection =>
+val trustStore = sparkProperties.get("spark.ssl.fs.trustStore")
--- End diff --

There is a common util method in `Utils`, but the thing is that using that 
method as I did before will initialize `Logging` prematurely, that will have 
some issues as mentioned by @vanzin , so that's why I reimplemented here.

Do you have any thought?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-29 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r124949216
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -841,36 +845,132 @@ object SparkSubmit extends CommandLineUtils {
* Download a list of remote files to temp local files. If the file is 
local, the original file
* will be returned.
* @param fileList A comma separated file list.
+   * @param targetDir A temporary directory for which downloaded files
+   * @param sparkProperties Spark properties
* @return A comma separated local files list.
*/
   private[deploy] def downloadFileList(
   fileList: String,
+  targetDir: File,
+  sparkProperties: Map[String, String],
   hadoopConf: HadoopConfiguration): String = {
 require(fileList != null, "fileList cannot be null.")
-fileList.split(",").map(downloadFile(_, hadoopConf)).mkString(",")
+fileList.split(",")
+  .map(downloadFile(_, targetDir, sparkProperties, hadoopConf))
+  .mkString(",")
   }
 
   /**
* Download a file from the remote to a local temporary directory. If 
the input path points to
* a local path, returns it with no operation.
+   * @param path A file path from where the files will be downloaded.
+   * @param targetDir A temporary directory for which downloaded files
+   * @param sparkProperties Spark properties
+   * @return A comma separated local files list.
*/
-  private[deploy] def downloadFile(path: String, hadoopConf: 
HadoopConfiguration): String = {
+  private[deploy] def downloadFile(
+  path: String,
+  targetDir: File,
+  sparkProperties: Map[String, String],
+  hadoopConf: HadoopConfiguration): String = {
 require(path != null, "path cannot be null.")
 val uri = Utils.resolveURI(path)
 uri.getScheme match {
-  case "file" | "local" =>
-path
+  case "file" | "local" => path
+  case "http" | "https" | "ftp" =>
+val uc = uri.toURL.openConnection()
+uc match {
+  case https: HttpsURLConnection =>
+val trustStore = sparkProperties.get("spark.ssl.fs.trustStore")
--- End diff --

Also, should we test against this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-29 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r124949124
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -841,36 +845,132 @@ object SparkSubmit extends CommandLineUtils {
* Download a list of remote files to temp local files. If the file is 
local, the original file
* will be returned.
* @param fileList A comma separated file list.
+   * @param targetDir A temporary directory for which downloaded files
+   * @param sparkProperties Spark properties
* @return A comma separated local files list.
*/
   private[deploy] def downloadFileList(
   fileList: String,
+  targetDir: File,
+  sparkProperties: Map[String, String],
   hadoopConf: HadoopConfiguration): String = {
 require(fileList != null, "fileList cannot be null.")
-fileList.split(",").map(downloadFile(_, hadoopConf)).mkString(",")
+fileList.split(",")
+  .map(downloadFile(_, targetDir, sparkProperties, hadoopConf))
+  .mkString(",")
   }
 
   /**
* Download a file from the remote to a local temporary directory. If 
the input path points to
* a local path, returns it with no operation.
+   * @param path A file path from where the files will be downloaded.
+   * @param targetDir A temporary directory for which downloaded files
+   * @param sparkProperties Spark properties
+   * @return A comma separated local files list.
*/
-  private[deploy] def downloadFile(path: String, hadoopConf: 
HadoopConfiguration): String = {
+  private[deploy] def downloadFile(
+  path: String,
+  targetDir: File,
+  sparkProperties: Map[String, String],
+  hadoopConf: HadoopConfiguration): String = {
 require(path != null, "path cannot be null.")
 val uri = Utils.resolveURI(path)
 uri.getScheme match {
-  case "file" | "local" =>
-path
+  case "file" | "local" => path
+  case "http" | "https" | "ftp" =>
+val uc = uri.toURL.openConnection()
+uc match {
+  case https: HttpsURLConnection =>
+val trustStore = sparkProperties.get("spark.ssl.fs.trustStore")
+  .orElse(sparkProperties.get("spark.ssl.trustStore"))
--- End diff --

Should we move these properties to internal/config?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-29 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r124948820
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -841,36 +845,132 @@ object SparkSubmit extends CommandLineUtils {
* Download a list of remote files to temp local files. If the file is 
local, the original file
* will be returned.
* @param fileList A comma separated file list.
+   * @param targetDir A temporary directory for which downloaded files
+   * @param sparkProperties Spark properties
* @return A comma separated local files list.
*/
   private[deploy] def downloadFileList(
   fileList: String,
+  targetDir: File,
+  sparkProperties: Map[String, String],
   hadoopConf: HadoopConfiguration): String = {
 require(fileList != null, "fileList cannot be null.")
-fileList.split(",").map(downloadFile(_, hadoopConf)).mkString(",")
+fileList.split(",")
+  .map(downloadFile(_, targetDir, sparkProperties, hadoopConf))
+  .mkString(",")
   }
 
   /**
* Download a file from the remote to a local temporary directory. If 
the input path points to
* a local path, returns it with no operation.
+   * @param path A file path from where the files will be downloaded.
+   * @param targetDir A temporary directory for which downloaded files
+   * @param sparkProperties Spark properties
+   * @return A comma separated local files list.
*/
-  private[deploy] def downloadFile(path: String, hadoopConf: 
HadoopConfiguration): String = {
+  private[deploy] def downloadFile(
+  path: String,
+  targetDir: File,
+  sparkProperties: Map[String, String],
+  hadoopConf: HadoopConfiguration): String = {
 require(path != null, "path cannot be null.")
 val uri = Utils.resolveURI(path)
 uri.getScheme match {
-  case "file" | "local" =>
-path
+  case "file" | "local" => path
+  case "http" | "https" | "ftp" =>
+val uc = uri.toURL.openConnection()
+uc match {
+  case https: HttpsURLConnection =>
+val trustStore = sparkProperties.get("spark.ssl.fs.trustStore")
--- End diff --

Should we make this a common util?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-26 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r123932628
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -310,33 +310,28 @@ object SparkSubmit extends CommandLineUtils {
   RPackageUtils.checkAndBuildRPackage(args.jars, printStream, 
args.verbose)
 }
 
-// In client mode, download remote files.
-if (deployMode == CLIENT) {
-  val hadoopConf = new HadoopConfiguration()
-  args.primaryResource = 
Option(args.primaryResource).map(downloadFile(_, hadoopConf)).orNull
-  args.jars = Option(args.jars).map(downloadFileList(_, 
hadoopConf)).orNull
-  args.pyFiles = Option(args.pyFiles).map(downloadFileList(_, 
hadoopConf)).orNull
-  args.files = Option(args.files).map(downloadFileList(_, 
hadoopConf)).orNull
-}
+val hadoopConf = new HadoopConfiguration()
+val targetDir = Files.createTempDirectory("tmp").toFile
+val sparkConf = new SparkConf(loadDefaults = 
false).setAll(args.sparkProperties)
--- End diff --

@vanzin yes, your concern is correct. `Logging` module will be initialized 
by other class, thus default log level constraint is broken.

The purpose here is to use `Utils.doFetchFile` without duplication. One 
solution I could think of is to add the similar code here to avoid logging 
initialization.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-25 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r123922136
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -310,33 +310,28 @@ object SparkSubmit extends CommandLineUtils {
   RPackageUtils.checkAndBuildRPackage(args.jars, printStream, 
args.verbose)
 }
 
-// In client mode, download remote files.
-if (deployMode == CLIENT) {
-  val hadoopConf = new HadoopConfiguration()
-  args.primaryResource = 
Option(args.primaryResource).map(downloadFile(_, hadoopConf)).orNull
-  args.jars = Option(args.jars).map(downloadFileList(_, 
hadoopConf)).orNull
-  args.pyFiles = Option(args.pyFiles).map(downloadFileList(_, 
hadoopConf)).orNull
-  args.files = Option(args.files).map(downloadFileList(_, 
hadoopConf)).orNull
-}
+val hadoopConf = new HadoopConfiguration()
+val targetDir = Files.createTempDirectory("tmp").toFile
--- End diff --

From my understanding currently no code is responsible for deleting, let me 
check the code.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-23 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r123870610
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -845,32 +840,54 @@ object SparkSubmit extends CommandLineUtils {
*/
   private[deploy] def downloadFileList(
   fileList: String,
+  targetDir: File,
+  sparkConf: SparkConf,
+  securityManager: SecurityManager,
   hadoopConf: HadoopConfiguration): String = {
 require(fileList != null, "fileList cannot be null.")
-fileList.split(",").map(downloadFile(_, hadoopConf)).mkString(",")
+fileList.split(",")
+  .map(downloadFile(_, targetDir, sparkConf, securityManager, 
hadoopConf))
+  .mkString(",")
   }
 
   /**
* Download a file from the remote to a local temporary directory. If 
the input path points to
* a local path, returns it with no operation.
*/
-  private[deploy] def downloadFile(path: String, hadoopConf: 
HadoopConfiguration): String = {
+  private[deploy] def downloadFile(
+  path: String,
+  targetDir: File,
--- End diff --

ditto


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-23 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r123870606
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -845,32 +840,54 @@ object SparkSubmit extends CommandLineUtils {
*/
   private[deploy] def downloadFileList(
   fileList: String,
+  targetDir: File,
--- End diff --

Let's explain the meaning of each param.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-23 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r123870719
  
--- Diff: 
core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala ---
@@ -752,25 +793,35 @@ class SparkSubmitSuite
 
   test("downloadFile - invalid url") {
 intercept[IOException] {
-  SparkSubmit.downloadFile("abc:/my/file", new Configuration())
+  val sparkConf = new SparkConf()
--- End diff --

Should we create a `testDownloadFile()` function and merge the duplicated 
test code?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-23 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r123800801
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -310,33 +310,28 @@ object SparkSubmit extends CommandLineUtils {
   RPackageUtils.checkAndBuildRPackage(args.jars, printStream, 
args.verbose)
 }
 
-// In client mode, download remote files.
-if (deployMode == CLIENT) {
-  val hadoopConf = new HadoopConfiguration()
-  args.primaryResource = 
Option(args.primaryResource).map(downloadFile(_, hadoopConf)).orNull
-  args.jars = Option(args.jars).map(downloadFileList(_, 
hadoopConf)).orNull
-  args.pyFiles = Option(args.pyFiles).map(downloadFileList(_, 
hadoopConf)).orNull
-  args.files = Option(args.files).map(downloadFileList(_, 
hadoopConf)).orNull
-}
+val hadoopConf = new HadoopConfiguration()
+val targetDir = Files.createTempDirectory("tmp").toFile
--- End diff --

I see this comes from the previous code, but who's deleting this directory?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-23 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r123802866
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -310,33 +310,28 @@ object SparkSubmit extends CommandLineUtils {
   RPackageUtils.checkAndBuildRPackage(args.jars, printStream, 
args.verbose)
 }
 
-// In client mode, download remote files.
-if (deployMode == CLIENT) {
-  val hadoopConf = new HadoopConfiguration()
-  args.primaryResource = 
Option(args.primaryResource).map(downloadFile(_, hadoopConf)).orNull
-  args.jars = Option(args.jars).map(downloadFileList(_, 
hadoopConf)).orNull
-  args.pyFiles = Option(args.pyFiles).map(downloadFileList(_, 
hadoopConf)).orNull
-  args.files = Option(args.files).map(downloadFileList(_, 
hadoopConf)).orNull
-}
+val hadoopConf = new HadoopConfiguration()
+val targetDir = Files.createTempDirectory("tmp").toFile
+val sparkConf = new SparkConf(loadDefaults = 
false).setAll(args.sparkProperties)
--- End diff --

Hmm... I think there might be a problem with logging when calling these 
classes here. This code is running before the app's main class is called, and 
it may initialize the logging system (e.g. `Utils.doFetchFile` later calls 
logging methods in all code paths, initializing the logging system, and so does 
`SecurityManager`).

This will be a problem, for example, for the spark-shell, which triggers a 
slightly different code path when initializing logging. Because now the logging 
system will be initialized by a class that is not the REPL's `Main` class, that 
will be lost.

Long story short, with the code you're adding, the shell should still be 
initialized with `WARN` as the default log level. Can you run `spark-shell` 
with a remote jar in the jars list and verify it still maintains that?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-19 Thread jerryshao
Github user jerryshao commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r122707282
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -858,19 +844,33 @@ object SparkSubmit extends CommandLineUtils {
 require(path != null, "path cannot be null.")
 val uri = Utils.resolveURI(path)
 uri.getScheme match {
-  case "file" | "local" =>
-path
-
+  case "file" | "local" => path
+  case "https" | "http" | "ftp" => path
--- End diff --

My original thinking is that `FileSystem` API doesn't support such API, so 
I treat them separately without downloading. Will verify it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-16 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r122503215
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -858,19 +844,33 @@ object SparkSubmit extends CommandLineUtils {
 require(path != null, "path cannot be null.")
 val uri = Utils.resolveURI(path)
 uri.getScheme match {
-  case "file" | "local" =>
-path
-
+  case "file" | "local" => path
+  case "https" | "http" | "ftp" => path
   case _ =>
 val fs = FileSystem.get(uri, hadoopConf)
-val tmpFile = new File(Files.createTempDirectory("tmp").toFile, 
uri.getPath)
+val tmpFile = new File(Files.createTempDirectory("tmp").toFile, 
new Path(uri).getName)
 // scalastyle:off println
 printStream.println(s"Downloading ${uri.toString} to 
${tmpFile.getAbsolutePath}.")
 // scalastyle:on println
 fs.copyToLocalFile(new Path(uri), new 
Path(tmpFile.getAbsolutePath))
 Utils.resolveURI(tmpFile.getAbsolutePath).toString
 }
   }
+
+  private def resolveGlobPaths(paths: String, hadoopConf: 
HadoopConfiguration): String = {
+require(paths != null, "paths cannot be null.")
+paths.split(",").filter(_.trim.nonEmpty).flatMap { path =>
--- End diff --

minor, but you should actually trim the strings, not just for filtering 
purposes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-16 Thread vanzin
Github user vanzin commented on a diff in the pull request:

https://github.com/apache/spark/pull/18235#discussion_r122502830
  
--- Diff: core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala ---
@@ -858,19 +844,33 @@ object SparkSubmit extends CommandLineUtils {
 require(path != null, "path cannot be null.")
 val uri = Utils.resolveURI(path)
 uri.getScheme match {
-  case "file" | "local" =>
-path
-
+  case "file" | "local" => path
+  case "https" | "http" | "ftp" => path
--- End diff --

This is wrong, right? These need to be downloaded, and now they won't be.

(Whether the `FileSystem` API can download those is a separate question... 
if it doesn't, you'll end up with another implementation of 
`Utils.doFetchFile`.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #18235: [SPARK-21012][Submit] Add glob support for resour...

2017-06-07 Thread jerryshao
GitHub user jerryshao opened a pull request:

https://github.com/apache/spark/pull/18235

[SPARK-21012][Submit] Add glob support for resources adding to Spark

Current "--jars (spark.jars)", "--files (spark.files)", "--py-files 
(spark.submit.pyFiles)" and "--archives (spark.yarn.dist.archives)" only 
support non-glob path. This is OK for most of the cases, but when user requires 
to add more jars, files into Spark, it is too verbose to list one by one. So 
here propose to add glob path support for resources.

Also improving the code of downloading resources.

## How was this patch tested?

UT added, also verified manually in local cluster.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jerryshao/apache-spark SPARK-21012

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/18235.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #18235


commit 339888964e04e1097b0c522a7eeb13a280c22ddc
Author: jerryshao 
Date:   2017-06-08T02:58:09Z

Add glob support for resources adding to Spark

Change-Id: I219c1c9b7b53b7fb724cfbdc76cc03038f656bfb




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org