[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651023#comment-14651023 ] Michael Smith commented on SPARK-7230: -- I support Antonio's request to bring back this functionality in version 1.5 so that plyrmr can continue to be used with the Spark backend as before. Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical Fix For: 1.4.0 This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564943#comment-14564943 ] Vincent Warmerdam commented on SPARK-7230: -- [~shivaram] on it. you should see some pull requests soon. ill also add a pull request for the master github branch but this is only to edit the readme file to have a similar message listed. Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical Fix For: 1.4.0 This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563593#comment-14563593 ] Shivaram Venkataraman commented on SPARK-7230: -- [~cantdutchthis] Thanks both for the write up and for the heads up on the website at amplab-extras/SparkR-pkg being out of date. The old SparkR website lives at https://github.com/amplab-extras/SparkR-pkg/tree/gh-pages -- Let me know if you are interested in making this change and if so you can just open a PR. Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical Fix For: 1.4.0 This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563563#comment-14563563 ] Antonio Piccolboni commented on SPARK-7230: --- And if I may add a side note, I highly recommend Vincent's post (http://blog.rstudio.org/2015/05/28/sparkr-preview-by-vincent-warmerdam/). I think it's a fitting epitaph for the RDD API. Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical Fix For: 1.4.0 This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563554#comment-14563554 ] Vincent Warmerdam commented on SPARK-7230: -- If this decision is now final, it might be good to explicitly communicate this API change on the old SparkR github page. Although I should have been reading the news on SparkR from here; I was using https://github.com/amplab-extras/SparkR-pkg as a reference until now because the documentation was better. If this would be communicated it would alleviate some pain to starting developers. Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical Fix For: 1.4.0 This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532101#comment-14532101 ] Sun Rui commented on SPARK-7230: One question here is there are still some basic RDD API methods provided in DataFrame, like map()/flatMap()/MapPartitions() and foreach(). What's our policy on these methods()? Will we also make them private for 1.4 or we will support them for long term? Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical Fix For: 1.4.0 This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532106#comment-14532106 ] Reynold Xin commented on SPARK-7230: We should hide them for now. As a matter of fact, I think those shouldn't even exist in the Scala/Python version of DataFrames, but those are hard to remove now. Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical Fix For: 1.4.0 This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533728#comment-14533728 ] Sun Rui commented on SPARK-7230: [~shivaram], got it. thanks. Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical Fix For: 1.4.0 This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532913#comment-14532913 ] Shivaram Venkataraman commented on SPARK-7230: -- Actually with the namespace change in PR 5895 these have been made private. We no longer export `map`, `flatMatp` etc. in SparkR's namespace and this applies to RDD and DataFrame. Its still available as a private API, so we can use if its required for implementing newer APIs / ML algorithms etc. Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical Fix For: 1.4.0 This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527402#comment-14527402 ] Apache Spark commented on SPARK-7230: - User 'shivaram' has created a pull request for this issue: https://github.com/apache/spark/pull/5895 Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520058#comment-14520058 ] Patrick Wendell commented on SPARK-7230: I think this is a good idea. We should expose a narrower higher level API here and then look at user feedback to understand whether we want to support something lower level. From my experience with PySpark, it was a huge effort (probably more than 5X the original contribution) to actually implement everything in the lowest level Spark API's. And for the R community I don't think those low level ETL API's are that useful. So I'd be inclined to keep it simple at the beginning and then add complexity if we see new user demand. Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520424#comment-14520424 ] Antonio Piccolboni commented on SPARK-7230: --- plyrmr on spark depends on the RDD API and has hundreds of downloads per month. We also have an experimental doParallelSpark that interfaces spark with foreach, hence with 50+ R packages including mainstream ones like CARET. I know the current API wasn't meant to be stable, but retiring the whole thing, I think that's a declaration of war on people developing on top of it. Sure you are going to have more mainstream appeal with the proposed changes and dataframe API, no discussion, but as far as appealing to developers, that's an unambiguous F you directed to them. rmr2 is a package that interfaces R with mapreduce at a similar level of abstraction as sparkR, and has thousands of downloads per month and a commercial product based on it. Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520465#comment-14520465 ] Shivaram Venkataraman commented on SPARK-7230: -- [~piccolbo] Thanks for your input. I just want to clarify a couple of things that I might not have made clear in the ticket description 1. The proposal is not retire the API but just to make it private for the Spark 1.4 release while we can figure out what the API should be. The main problem is that the RDD API is very verbose with a large number of low-level ETL functions. Additionally it opens up a number of failure modes which we don't want to expose to end users in a release. In fact some of the closure cleaning bugs that you have found are examples of things which are not well specified in the RDD API at this point. 2. Supporting packages like plyrmr on top of SparkR is definitely something we would like to do. If you take a look at the first cut design doc linked above http://goo.gl/GLHKZI the proposal is to go towards an API similar to snow or `parallel` which are successful and existing R packages. And as far as I know `rmr2` also has a simple API with only a few functions. Finally I took a look at the plyrmr codebase as well and from what I can see the main SparkR functions currently used in plyrmr are `lapply` or `lapplyPartition`. This is pretty similar to the API required for the use cases described in the doc above. 3. Finally we can continue discussion of what functions should be a part of the API and what the contracts for UDFs should be in https://issues.apache.org/jira/browse/SPARK-7264 -- The goal is to have that for Spark 1.5 release and we'd definitely like your input for it. Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520486#comment-14520486 ] Reynold Xin commented on SPARK-7230: The existing SparkR package is still out there that you can use, can't you? https://github.com/amplab-extras/SparkR-pkg Note that we are not removing anything, since this was never released ... Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520479#comment-14520479 ] Antonio Piccolboni commented on SPARK-7230: --- If you make a call private, you break every single package that uses it. It's the same as erasing the code. What do you want me to do, to sprinkle my code with SparkR::: to make it work? Fork SparkR to maintain a different NAMESPACE file? The reasonable course here is to a) announce big changes in the API and call for proposals, discussion. check b) implement new API, say 1.5 c) retire old API. It looks to me you are applying an arbitrary permutation of those steps. Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520478#comment-14520478 ] Patrick Wendell commented on SPARK-7230: Yeah the goal is absolutely to support higher level apps, this has always been a goal in every part of Spark. In fact from the beginning we've made sure R is first class in the way we think about things like our package ecosystem (spark packages). The issue with what is there now is that there is this huge RDD API exposed, way beyond the basic parallelization primitives, and that needs more vetting before we ship it in a public API. Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4
[ https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520635#comment-14520635 ] Patrick Wendell commented on SPARK-7230: Yes - removing API's is really difficult for existing users. That's why the proposal here will limit the number of exposed API's substantially, because otherwise we will never be able to remove them. Part of merging into the upstream project is looking at which API's the commitership are comfortable supporting in the long term. As it stands, there isn't widespread support in the committership for supporting low level ETL code in R in the long term. We'd rather have narrower and simpler API's. Of course we'll make a good faith effort to support API's that are useful to existing projects. Make RDD API private in SparkR for Spark 1.4 Key: SPARK-7230 URL: https://issues.apache.org/jira/browse/SPARK-7230 Project: Spark Issue Type: Sub-task Components: SparkR Affects Versions: 1.4.0 Reporter: Shivaram Venkataraman Assignee: Shivaram Venkataraman Priority: Critical This ticket proposes making the RDD API in SparkR private for the 1.4 release. The motivation for doing so are discussed in a larger design document aimed at a more top-down design of the SparkR APIs. A first cut that discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI The main points in that document that relate to this ticket are: - The RDD API requires knowledge of the distributed system and is pretty low level. This is not very suitable for a number of R users who are used to more high-level packages that work out of the box. - The RDD implementation in SparkR is not fully robust right now: we are missing features like spilling for aggregation, handling partitions which don't fit in memory etc. There are further limitations like lack of hashCode for non-native types etc. which might affect user experience. The only change we will make for now is to not export the RDD functions as public methods in the SparkR package and I will create another ticket for discussing more details public API for 1.5 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org