[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-08-02 Thread Michael Smith (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651023#comment-14651023
 ] 

Michael Smith commented on SPARK-7230:
--

I support Antonio's request to bring back this functionality in version 1.5 so 
that plyrmr can continue to be used with the Spark backend as before. 

 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical
 Fix For: 1.4.0


 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-05-29 Thread Vincent Warmerdam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14564943#comment-14564943
 ] 

Vincent Warmerdam commented on SPARK-7230:
--

[~shivaram] on it. you should see some pull requests soon. ill also add a pull 
request for the master github branch but this is only to edit the readme file 
to have a similar message listed. 

 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical
 Fix For: 1.4.0


 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-05-28 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563593#comment-14563593
 ] 

Shivaram Venkataraman commented on SPARK-7230:
--

[~cantdutchthis] Thanks both for the write up and for the heads up on the 
website at amplab-extras/SparkR-pkg being out of date. The old SparkR website 
lives at https://github.com/amplab-extras/SparkR-pkg/tree/gh-pages -- Let me 
know if you are interested in making this change and if so you can just open a 
PR.  

 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical
 Fix For: 1.4.0


 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-05-28 Thread Antonio Piccolboni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563563#comment-14563563
 ] 

Antonio Piccolboni commented on SPARK-7230:
---

And if I may add a side note, I highly recommend Vincent's post 
(http://blog.rstudio.org/2015/05/28/sparkr-preview-by-vincent-warmerdam/). I 
think it's a fitting epitaph for the RDD API. 

 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical
 Fix For: 1.4.0


 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-05-28 Thread Vincent Warmerdam (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563554#comment-14563554
 ] 

Vincent Warmerdam commented on SPARK-7230:
--

If this decision is now final, it might be good to explicitly communicate this 
API change on the old SparkR github page. 

Although I should have been reading the news on SparkR from here; I was using 
https://github.com/amplab-extras/SparkR-pkg as a reference until now because 
the documentation was better. If this would be communicated it would alleviate 
some pain to starting developers. 

 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical
 Fix For: 1.4.0


 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-05-07 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532101#comment-14532101
 ] 

Sun Rui commented on SPARK-7230:


One question here is there are still some basic RDD API methods provided in 
DataFrame, like map()/flatMap()/MapPartitions() and foreach(). What's our 
policy on these methods()? Will we also make them private for 1.4  or we will 
support them for long term?


 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical
 Fix For: 1.4.0


 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-05-07 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532106#comment-14532106
 ] 

Reynold Xin commented on SPARK-7230:


We should hide them for now.  As a matter of fact, I think those shouldn't even 
exist in the Scala/Python version of DataFrames, but those are hard to remove 
now.


 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical
 Fix For: 1.4.0


 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-05-07 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14533728#comment-14533728
 ] 

Sun Rui commented on SPARK-7230:


[~shivaram], got it. thanks.

 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical
 Fix For: 1.4.0


 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-05-07 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14532913#comment-14532913
 ] 

Shivaram Venkataraman commented on SPARK-7230:
--

Actually with the namespace change in PR 5895 these have been made private. We 
no longer export `map`, `flatMatp` etc. in SparkR's namespace and this applies 
to RDD and DataFrame. Its still available as a private API, so we can use if 
its required for implementing newer APIs / ML algorithms etc.

 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical
 Fix For: 1.4.0


 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-05-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14527402#comment-14527402
 ] 

Apache Spark commented on SPARK-7230:
-

User 'shivaram' has created a pull request for this issue:
https://github.com/apache/spark/pull/5895

 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical

 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-04-29 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520058#comment-14520058
 ] 

Patrick Wendell commented on SPARK-7230:


I think this is a good idea. We should expose a narrower higher level API here 
and then look at user feedback to understand whether we want to support 
something lower level. From my experience with PySpark, it was a huge effort 
(probably more than 5X the original contribution) to actually implement 
everything in the lowest level Spark API's. And for the R community I don't 
think those low level ETL API's are that useful. So I'd be inclined to keep it 
simple at the beginning and then add complexity if we see new user demand.

 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical

 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-04-29 Thread Antonio Piccolboni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520424#comment-14520424
 ] 

Antonio Piccolboni commented on SPARK-7230:
---

plyrmr on spark depends on the RDD API and has hundreds of downloads per month. 
 We also have an experimental doParallelSpark that interfaces spark with 
foreach, hence with 50+ R packages including mainstream ones like CARET. I know 
the current API wasn't meant to be stable, but retiring the whole thing, I 
think that's a declaration of war on people developing on top of it.  Sure you 
are going to have more mainstream appeal with the proposed changes and 
dataframe API, no discussion, but as far as appealing to developers, that's an 
unambiguous F you directed to them. rmr2 is a package that interfaces R with 
mapreduce at a similar level of abstraction as sparkR, and has thousands of 
downloads per month and a commercial product based on it. 

 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical

 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-04-29 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520465#comment-14520465
 ] 

Shivaram Venkataraman commented on SPARK-7230:
--

[~piccolbo] Thanks for your input. I just want to clarify a couple of things 
that I might not have made clear in the ticket description

1. The proposal is not retire the API but just to make it private for the Spark 
1.4 release while we can figure out what the API should be. The main problem is 
that the RDD API is very verbose with a large number of low-level ETL 
functions. Additionally it opens up a number of failure modes which we don't 
want to expose to end users in a release. In fact some of the closure cleaning 
bugs that you have found are examples of things which are not well specified in 
the RDD API at this point.

2. Supporting packages like plyrmr on top of SparkR is definitely something we 
would like to do. If you take a look at the first cut design doc linked above 
http://goo.gl/GLHKZI the proposal is to go towards an API similar to snow or 
`parallel` which are successful and existing R packages. And as far as I know 
`rmr2` also has a simple API with only a few functions. Finally I took a look 
at the plyrmr codebase as well and from what I can see the main SparkR 
functions currently used in plyrmr are `lapply` or `lapplyPartition`. This is 
pretty similar to the API required for the use cases described in the doc above.

3. Finally we can continue discussion of what functions should be a part of the 
API and what the contracts for UDFs should be in 
https://issues.apache.org/jira/browse/SPARK-7264 -- The goal is to have that 
for Spark 1.5 release and we'd definitely like your input for it. 

 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical

 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-04-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520486#comment-14520486
 ] 

Reynold Xin commented on SPARK-7230:


The existing SparkR package is still out there that you can use, can't you? 
https://github.com/amplab-extras/SparkR-pkg

Note that we are not removing anything, since this was never released ...


 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical

 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-04-29 Thread Antonio Piccolboni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520479#comment-14520479
 ] 

Antonio Piccolboni commented on SPARK-7230:
---

If you make a call  private, you break every single package that uses it. It's 
the same as erasing the code. What do you want me to do, to sprinkle my code 
with SparkR::: to make it work? Fork SparkR to maintain a different NAMESPACE 
file? The reasonable course here is to a) announce big changes in the API and 
call for proposals, discussion. check b) implement new API, say 1.5 c) retire 
old API. It looks to me you are applying an arbitrary permutation of those 
steps. 

 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical

 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-04-29 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520478#comment-14520478
 ] 

Patrick Wendell commented on SPARK-7230:


Yeah the goal is absolutely to support higher level apps, this has always been 
a goal in every part of Spark. In fact from the beginning we've made sure R is 
first class in the way we think about things like our package ecosystem (spark 
packages). The issue with what is there now is that there is this huge RDD API 
exposed, way beyond the basic parallelization primitives, and that needs more 
vetting before we ship it in a public API.

 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical

 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7230) Make RDD API private in SparkR for Spark 1.4

2015-04-29 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7230?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14520635#comment-14520635
 ] 

Patrick Wendell commented on SPARK-7230:


Yes - removing API's is really difficult for existing users. That's why the 
proposal here will limit the number of exposed API's substantially, because 
otherwise we will never be able to remove them. Part of merging into the 
upstream project is looking at which API's the commitership are comfortable 
supporting in the long term. As it stands, there isn't widespread support in 
the committership for supporting low level ETL code in R in the long term. We'd 
rather have narrower and simpler API's.

Of course we'll make a good faith effort to support API's that are useful to 
existing projects.

 Make RDD API private in SparkR for Spark 1.4
 

 Key: SPARK-7230
 URL: https://issues.apache.org/jira/browse/SPARK-7230
 Project: Spark
  Issue Type: Sub-task
  Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
Assignee: Shivaram Venkataraman
Priority: Critical

 This ticket proposes making the RDD API in SparkR private for the 1.4 
 release. The motivation for doing so are discussed in a larger design 
 document aimed at a more top-down design of the SparkR APIs. A first cut that 
 discusses motivation and proposed changes can be found at http://goo.gl/GLHKZI
 The main points in that document that relate to this ticket are:
 - The RDD API requires knowledge of the distributed system and is pretty low 
 level. This is not very suitable for a number of R users who are used to more 
 high-level packages that work out of the box.
 - The RDD implementation in SparkR is not fully robust right now: we are 
 missing features like spilling for aggregation, handling partitions which 
 don't fit in memory etc. There are further limitations like lack of hashCode 
 for non-native types etc. which might affect user experience.
 The only change we will make for now is to not export the RDD functions as 
 public methods in the SparkR package and I will create another ticket for 
 discussing more details public API for 1.5



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org