[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15140#discussion_r80030485 --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala --- @@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext) } /** + * Add a file to be downloaded with this Spark job on every node. + * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported + * filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, + * use `SparkFiles.get(fileName)` to find its download location. + * + * A directory can be given if the recursive option is set to true. Currently directories are only + * supported for Hadoop-supported filesystems. + */ + def addFile(path: String, recursive: Boolean): Unit = { --- End diff -- OK, if that requires the Java context, I get it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15140#discussion_r80030192 --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala --- @@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext) } /** + * Add a file to be downloaded with this Spark job on every node. + * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported + * filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, + * use `SparkFiles.get(fileName)` to find its download location. + * + * A directory can be given if the recursive option is set to true. Currently directories are only + * supported for Hadoop-supported filesystems. + */ + def addFile(path: String, recursive: Boolean): Unit = { --- End diff -- I think it's not required to undo this, since I will send a PR to support recursively add files under a directory for SparkR soon and it will leverage this API. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15140#discussion_r80028815 --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala --- @@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext) } /** + * Add a file to be downloaded with this Spark job on every node. + * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported + * filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, + * use `SparkFiles.get(fileName)` to find its download location. + * + * A directory can be given if the recursive option is set to true. Currently directories are only + * supported for Hadoop-supported filesystems. + */ + def addFile(path: String, recursive: Boolean): Unit = { --- End diff -- Yes, but can we undo this change? it doesn't seem like we need to duplicate this method in the Java API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15140#discussion_r80028421 --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala --- @@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext) } /** + * Add a file to be downloaded with this Spark job on every node. + * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported + * filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, + * use `SparkFiles.get(fileName)` to find its download location. + * + * A directory can be given if the recursive option is set to true. Currently directories are only + * supported for Hadoop-supported filesystems. + */ + def addFile(path: String, recursive: Boolean): Unit = { --- End diff -- @srowen After investigating the code, I found it's not very straightforward to clean up the interfaces at ```JavaSparkContext```, since they were called by Python and R. For Python side, we can use ```_jsc.sc()``` in some cases, but it's messy if we use both ```JavaSparkContext``` and ```JavaSparkContext.sc``` at R side. So I think we should leave it as it is, or any other suggestion? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15140#discussion_r79791116 --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala --- @@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext) } /** + * Add a file to be downloaded with this Spark job on every node. + * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported + * filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, + * use `SparkFiles.get(fileName)` to find its download location. + * + * A directory can be given if the recursive option is set to true. Currently directories are only + * supported for Hadoop-supported filesystems. + */ + def addFile(path: String, recursive: Boolean): Unit = { --- End diff -- Sounds good. There may be a reason the Java context is needed for some calls. I suppose that where the SparkContext could be used ... yeah that's simpler but doesn't really save anything because we wouldn't be able to take methods out of JavaSparkContext. That's why I was hoping to avoid adding a method to it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15140#discussion_r79790335 --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala --- @@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext) } /** + * Add a file to be downloaded with this Spark job on every node. + * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported + * filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, + * use `SparkFiles.get(fileName)` to find its download location. + * + * A directory can be given if the recursive option is set to true. Currently directories are only + * supported for Hadoop-supported filesystems. + */ + def addFile(path: String, recursive: Boolean): Unit = { --- End diff -- Oh, I see. I found ```_jsc.sc()``` and ```_jsc``` are mix-used in ```context.py```. I will do some clean up and unify them in a follow up work. Thanks for your comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15140#discussion_r79784786 --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala --- @@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext) } /** + * Add a file to be downloaded with this Spark job on every node. + * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported + * filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, + * use `SparkFiles.get(fileName)` to find its download location. + * + * A directory can be given if the recursive option is set to true. Currently directories are only + * supported for Hadoop-supported filesystems. + */ + def addFile(path: String, recursive: Boolean): Unit = { --- End diff -- You're calling the method on `SparkContext`: ``` self._jsc.sc().addFile(path, recursive) ``` I don't think this needed to be exposed? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/15140 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15140#discussion_r79783756 --- Diff: python/pyspark/context.py --- @@ -762,7 +762,7 @@ def accumulator(self, value, accum_param=None): SparkContext._next_accum_id += 1 return Accumulator(SparkContext._next_accum_id - 1, value, accum_param) -def addFile(self, path): +def addFile(self, path, recursive=False): --- End diff -- Yes, it does not change the existing API. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/15140#discussion_r79783577 --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala --- @@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext) } /** + * Add a file to be downloaded with this Spark job on every node. + * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported + * filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, + * use `SparkFiles.get(fileName)` to find its download location. + * + * A directory can be given if the recursive option is set to true. Currently directories are only + * supported for Hadoop-supported filesystems. + */ + def addFile(path: String, recursive: Boolean): Unit = { --- End diff -- Since ```JavaSparkContext``` is the Java stubs which will be called by PySpark. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15140#discussion_r79782388 --- Diff: python/pyspark/context.py --- @@ -762,7 +762,7 @@ def accumulator(self, value, accum_param=None): SparkContext._next_accum_id += 1 return Accumulator(SparkContext._next_accum_id - 1, value, accum_param) -def addFile(self, path): +def addFile(self, path, recursive=False): --- End diff -- This basically doesn't change the API right? you can still call it as before with the same behavior. It seems reasonable to me overall because it adds parity between the APIs, isn't complex and doesn't change behavior. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/15140#discussion_r79782153 --- Diff: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala --- @@ -670,6 +670,19 @@ class JavaSparkContext(val sc: SparkContext) } /** + * Add a file to be downloaded with this Spark job on every node. + * The `path` passed can be either a local file, a file in HDFS (or other Hadoop-supported + * filesystems), or an HTTP, HTTPS or FTP URI. To access the file in Spark jobs, + * use `SparkFiles.get(fileName)` to find its download location. + * + * A directory can be given if the recursive option is set to true. Currently directories are only + * supported for Hadoop-supported filesystems. + */ + def addFile(path: String, recursive: Boolean): Unit = { --- End diff -- (Why do we need this?) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...
Github user BryanCutler commented on a diff in the pull request: https://github.com/apache/spark/pull/15140#discussion_r79454288 --- Diff: python/pyspark/tests.py --- @@ -409,13 +409,22 @@ def func(x): self.assertEqual("Hello World!", res) def test_add_file_locally(self): -path = os.path.join(SPARK_HOME, "python/test_support/hello.txt") +path = os.path.join(SPARK_HOME, "python/test_support/hello/hello.txt") self.sc.addFile(path) download_path = SparkFiles.get("hello.txt") self.assertNotEqual(path, download_path) with open(download_path) as test_file: self.assertEqual("Hello World!\n", test_file.readline()) +path = os.path.join(SPARK_HOME, "python/test_support/hello") +self.sc.addFile(path, True) +download_path = SparkFiles.get("hello") +self.assertNotEqual(path, download_path) +with open(download_path + "/hello.txt") as test_file: +self.assertEqual("Hello World!\n", test_file.readline()) +with open(download_path + "/sub_hello/sub_hello.txt") as test_file: +self.assertEqual("Sub Hello World!\n", test_file.readline()) + --- End diff -- minor: maybe the above block should be in a separate test like `def test_add_file_locally_recursive`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15140: [SPARK-17585][PySpark][Core] PySpark SparkContext...
GitHub user yanboliang opened a pull request: https://github.com/apache/spark/pull/15140 [SPARK-17585][PySpark][Core] PySpark SparkContext.addFile supports adding files recursively ## What changes were proposed in this pull request? PySpark ```SparkContext.addFile``` should support adding files recursively under a directory similar with Scala. ## How was this patch tested? Unit test. You can merge this pull request into a Git repository by running: $ git pull https://github.com/yanboliang/spark spark-17585 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15140.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15140 commit c277a2cd755df908b8d0adc9863a3a30eb94784c Author: Yanbo Liang Date: 2016-09-18T14:28:01Z PySpark SparkContext.addFile supports adding files recursively --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org