[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...

2016-09-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15140#discussion_r80030485
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala ---
@@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext)
   }
 
   /**
+   * Add a file to be downloaded with this Spark job on every node.
+   * The `path` passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in 
Spark jobs,
+   * use `SparkFiles.get(fileName)` to find its download location.
+   *
+   * A directory can be given if the recursive option is set to true. 
Currently directories are only
+   * supported for Hadoop-supported filesystems.
+   */
+  def addFile(path: String, recursive: Boolean): Unit = {
--- End diff --

OK, if that requires the Java context, I get it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...

2016-09-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15140#discussion_r80030192
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala ---
@@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext)
   }
 
   /**
+   * Add a file to be downloaded with this Spark job on every node.
+   * The `path` passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in 
Spark jobs,
+   * use `SparkFiles.get(fileName)` to find its download location.
+   *
+   * A directory can be given if the recursive option is set to true. 
Currently directories are only
+   * supported for Hadoop-supported filesystems.
+   */
+  def addFile(path: String, recursive: Boolean): Unit = {
--- End diff --

I think it's not required to undo this, since I will send a PR to support 
recursively add files under a directory for SparkR soon and it will leverage 
this API. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...

2016-09-22 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15140#discussion_r80028815
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala ---
@@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext)
   }
 
   /**
+   * Add a file to be downloaded with this Spark job on every node.
+   * The `path` passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in 
Spark jobs,
+   * use `SparkFiles.get(fileName)` to find its download location.
+   *
+   * A directory can be given if the recursive option is set to true. 
Currently directories are only
+   * supported for Hadoop-supported filesystems.
+   */
+  def addFile(path: String, recursive: Boolean): Unit = {
--- End diff --

Yes, but can we undo this change? it doesn't seem like we need to duplicate 
this method in the Java API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...

2016-09-22 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15140#discussion_r80028421
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala ---
@@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext)
   }
 
   /**
+   * Add a file to be downloaded with this Spark job on every node.
+   * The `path` passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in 
Spark jobs,
+   * use `SparkFiles.get(fileName)` to find its download location.
+   *
+   * A directory can be given if the recursive option is set to true. 
Currently directories are only
+   * supported for Hadoop-supported filesystems.
+   */
+  def addFile(path: String, recursive: Boolean): Unit = {
--- End diff --

@srowen After investigating the code, I found it's not very straightforward 
to clean up the interfaces at ```JavaSparkContext```, since they were called by 
Python and R. For Python side, we can use ```_jsc.sc()``` in some cases, but 
it's messy if we use both ```JavaSparkContext``` and ```JavaSparkContext.sc``` 
at R side. So I think we should leave it as it is, or any other suggestion? 
Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...

2016-09-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15140#discussion_r79791116
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala ---
@@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext)
   }
 
   /**
+   * Add a file to be downloaded with this Spark job on every node.
+   * The `path` passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in 
Spark jobs,
+   * use `SparkFiles.get(fileName)` to find its download location.
+   *
+   * A directory can be given if the recursive option is set to true. 
Currently directories are only
+   * supported for Hadoop-supported filesystems.
+   */
+  def addFile(path: String, recursive: Boolean): Unit = {
--- End diff --

Sounds good. There may be a reason the Java context is needed for some 
calls. I suppose that where the SparkContext could be used ... yeah that's 
simpler but doesn't really save anything because we wouldn't be able to take 
methods out of JavaSparkContext. That's why I was hoping to avoid adding a 
method to it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...

2016-09-21 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15140#discussion_r79790335
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala ---
@@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext)
   }
 
   /**
+   * Add a file to be downloaded with this Spark job on every node.
+   * The `path` passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in 
Spark jobs,
+   * use `SparkFiles.get(fileName)` to find its download location.
+   *
+   * A directory can be given if the recursive option is set to true. 
Currently directories are only
+   * supported for Hadoop-supported filesystems.
+   */
+  def addFile(path: String, recursive: Boolean): Unit = {
--- End diff --

Oh, I see. I found ```_jsc.sc()``` and ```_jsc``` are mix-used in 
```context.py```. I will do some clean up and unify them in a follow up work. 
Thanks for your comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...

2016-09-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15140#discussion_r79784786
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala ---
@@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext)
   }
 
   /**
+   * Add a file to be downloaded with this Spark job on every node.
+   * The `path` passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in 
Spark jobs,
+   * use `SparkFiles.get(fileName)` to find its download location.
+   *
+   * A directory can be given if the recursive option is set to true. 
Currently directories are only
+   * supported for Hadoop-supported filesystems.
+   */
+  def addFile(path: String, recursive: Boolean): Unit = {
--- End diff --

You're calling the method on `SparkContext`: 

```
self._jsc.sc().addFile(path, recursive)
```

I don't think this needed to be exposed?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...

2016-09-21 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/15140


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...

2016-09-21 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15140#discussion_r79783756
  
--- Diff: python/pyspark/context.py ---
@@ -762,7 +762,7 @@ def accumulator(self, value, accum_param=None):
 SparkContext._next_accum_id += 1
 return Accumulator(SparkContext._next_accum_id - 1, value, 
accum_param)
 
-def addFile(self, path):
+def addFile(self, path, recursive=False):
--- End diff --

Yes, it does not change the existing API.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...

2016-09-21 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/15140#discussion_r79783577
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala ---
@@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext)
   }
 
   /**
+   * Add a file to be downloaded with this Spark job on every node.
+   * The `path` passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in 
Spark jobs,
+   * use `SparkFiles.get(fileName)` to find its download location.
+   *
+   * A directory can be given if the recursive option is set to true. 
Currently directories are only
+   * supported for Hadoop-supported filesystems.
+   */
+  def addFile(path: String, recursive: Boolean): Unit = {
--- End diff --

Since ```JavaSparkContext``` is the Java stubs which will be called by 
PySpark.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...

2016-09-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15140#discussion_r79782388
  
--- Diff: python/pyspark/context.py ---
@@ -762,7 +762,7 @@ def accumulator(self, value, accum_param=None):
 SparkContext._next_accum_id += 1
 return Accumulator(SparkContext._next_accum_id - 1, value, 
accum_param)
 
-def addFile(self, path):
+def addFile(self, path, recursive=False):
--- End diff --

This basically doesn't change the API right? you can still call it as 
before with the same behavior.

It seems reasonable to me overall because it adds parity between the APIs, 
isn't complex and doesn't change behavior.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...

2016-09-21 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/15140#discussion_r79782153
  
--- Diff: 
core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala ---
@@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext)
   }
 
   /**
+   * Add a file to be downloaded with this Spark job on every node.
+   * The `path` passed can be either a local file, a file in HDFS (or 
other Hadoop-supported
+   * filesystems), or an HTTP, HTTPS or FTP URI.  To access the file in 
Spark jobs,
+   * use `SparkFiles.get(fileName)` to find its download location.
+   *
+   * A directory can be given if the recursive option is set to true. 
Currently directories are only
+   * supported for Hadoop-supported filesystems.
+   */
+  def addFile(path: String, recursive: Boolean): Unit = {
--- End diff --

(Why do we need this?)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...

2016-09-19 Thread BryanCutler
Github user BryanCutler commented on a diff in the pull request:

https://github.com/apache/spark/pull/15140#discussion_r79454288
  
--- Diff: python/pyspark/tests.py ---
@@ -409,13 +409,22 @@ def func(x):
 self.assertEqual("Hello World!", res)
 
 def test_add_file_locally(self):
-path = os.path.join(SPARK_HOME, "python/test_support/hello.txt")
+path = os.path.join(SPARK_HOME, 
"python/test_support/hello/hello.txt")
 self.sc.addFile(path)
 download_path = SparkFiles.get("hello.txt")
 self.assertNotEqual(path, download_path)
 with open(download_path) as test_file:
 self.assertEqual("Hello World!\n", test_file.readline())
 
+path = os.path.join(SPARK_HOME, "python/test_support/hello")
+self.sc.addFile(path, True)
+download_path = SparkFiles.get("hello")
+self.assertNotEqual(path, download_path)
+with open(download_path + "/hello.txt") as test_file:
+self.assertEqual("Hello World!\n", test_file.readline())
+with open(download_path + "/sub_hello/sub_hello.txt") as test_file:
+self.assertEqual("Sub Hello World!\n", test_file.readline())
+
--- End diff --

minor:  maybe the above block should be in a separate test like `def 
test_add_file_locally_recursive`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...

2016-09-18 Thread yanboliang
GitHub user yanboliang opened a pull request:

https://github.com/apache/spark/pull/15140

[SPARK-17585][PySpark][Core] PySpark SparkContext.addFile supports adding 
files recursively

## What changes were proposed in this pull request?
PySpark ```SparkContext.addFile``` should support adding files recursively 
under a directory similar with Scala.

## How was this patch tested?
Unit test.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/yanboliang/spark spark-17585

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15140.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15140


commit c277a2cd755df908b8d0adc9863a3a30eb94784c
Author: Yanbo Liang 
Date:   2016-09-18T14:28:01Z

PySpark SparkContext.addFile supports adding files recursively




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org