[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337728#comment-15337728 ] Sean Owen commented on SPARK-6817: -- No, the best thing is just bulk-changing the issues to stand-alone issues. I can do that. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337525#comment-15337525 ] Shivaram Venkataraman commented on SPARK-6817: -- I think all the ones we need for 2.0 are completed here. [~srowen] Is there a clean way to mark the umbrella as complete for 2.0 and retarget the remaining for 2.1 ? > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264954#comment-15264954 ] Shivaram Venkataraman commented on SPARK-6817: -- I just merged https://issues.apache.org/jira/browse/SPARK-12919 which contains the main part of UDFs (`dapply`). I think we'll have a few follow up PRs during the QA period - but lets leave this as 2.0 target. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264836#comment-15264836 ] Michael Armbrust commented on SPARK-6817: - [~shivaram] Sill trying to get any of this in Spark 2.0? Or should we retarget? > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110299#comment-15110299 ] Felix Cheung commented on SPARK-6817: - Thanks for putting together on the doc [~sunrui] In this design, how does one control the partitioning? For instance, suppose one would like to group census data DataFrame by a certain column, say MetropolitanArea, and then pass to R's kmeans to cluster residents within close-by geographical areas. In order for the R UDFs to be effective, in this and some other cases, one would need to make sure the data is partition appropriately, and that mapPartition would produce a local R data.frame (assuming it fits into memory) that has all the relevant data in it? > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110359#comment-15110359 ] Sun Rui commented on SPARK-6817: for dapply(), user can call repartition() to set an appropriate number of partitions before calling dapply(). for gapply(), the SQL conf "spark.sql.shuffle.partitions" could be used to tune the partitions number after shuffle. I am also hoping SPARK-9850 Adaptive execution in Spark could help. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108097#comment-15108097 ] Sun Rui commented on SPARK-6817: Moved SQL UDF related stuff to SPARK-12918. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108098#comment-15108098 ] Sun Rui commented on SPARK-6817: I wrote an implementation document at https://docs.google.com/presentation/d/1oj17N5JaE8JDjT2as_DUI6LKutLcEHNZB29HsRGL_dM/edit?usp=sharing, please help to review it and give comments. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101805#comment-15101805 ] Sun Rui commented on SPARK-6817: Spark is now supporting vectorized execution via Columnar batch. See SPARK-12785 and SPARK-12635. I hope this could benefit SparkR UDF. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095860#comment-15095860 ] Sun Rui commented on SPARK-6817: I agree R's efficiency comes from vectorization. Here UDF is a function can be invoked in SQL queries, which is row-oriented. But row-orientation does not necessarily means R UDF will process one row each time. Actually, projected rows (according to the input parameters for a UDF) can be batched or even as a whole in a partition (if no OOM is concerned) and then passed into an R worker. The R worker can load the batch of rows into vectors or lists in memory and the R UDF can still do vectorized operations. Here the point is support of column-oriented UDF, which is something like UDAF, but I doubt UDAF is not exact match, because UDAF only returns only one value for a column. But in R, operations on a column may still return another non-scalar column. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095868#comment-15095868 ] Sun Rui commented on SPARK-6817: If we think that column-oriented UDF is more important, I can do it with a higher priority. Just need some help on technical insights. I doubt UDAF is not exact match, because UDAF only returns only one value for a column. But in R, operations on a column may still return another non-scalar column. And in order to seamlessly use existing R functions, a column in question need to be loaded as a whole into memory as a vector or list, which may suffer OOM if the column has too many rows. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096738#comment-15096738 ] Antonio Piccolboni commented on SPARK-6817: --- So I am not sure row-orientation means anything anymore. Could you please point me to any examples or documentation for "projected rows" and "batching" UDFs? Sorry I am writing my first few UDFs, I may just need more education. The mechanism I am using is to extend the org.apache.hadoop.hive.ql.exec.UDF and provide an evaluate method. This is called once for each row. I don't have to invoke R at every call, but I have to return something, which is used directly or indirectly to create elements in a new column. Plus, and that may just be ignorance on my part, I don't know of any method that is invoked when all the data has been seen or any other way to detect I am on the last record from evaluate. I don't see how I can batch say 1000 rows and compute without this information. What happens when I batched the last incomplete batch and the last call to evaluate happens with no knowledge it's the last one? > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097712#comment-15097712 ] Antonio Piccolboni commented on SPARK-6817: --- Thanks! > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097723#comment-15097723 ] Sun Rui commented on SPARK-6817: OK. I will follow the design of the original design doc, while I think the term of UDF here is a little bit confusing. For dapply(), basically it sounds like passing an R function into DataFrame.mapPartitions(). The R function takes a local data.frame as input parameter, which is converted from a partition of the DataFrame. For gapply(), it makes less sense to depend on UDAF, as 1. UDAF returns single value, 2. UDAF is processed with each row each time, which is not efficient. basically, converts the DataFrame to an RDD, and then call RDD.groupBy(), and then feed the grouped values into R worker. cc [~shivaram], any comments? > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097458#comment-15097458 ] Sun Rui commented on SPARK-6817: projecting batching rows for UDF are implmentation optimizations for saving communication cost between JVM and non-JVM interpreter like python or R, so there is no documentation. Conceptually, UDF is passed in a row each time. But for R, which typically handles vectors, it is feasible to transform the batch of rows into columns and pass the column vector into R UDF as arguments. But this may need clear statement saying that R UDF is expected to handle vector arguments from a batch of rows. The output of R UDF is still vectors, that can be passed back to JVM as result. In this way, the UDF actually is called once on the batch of rows. For UDF, you don't need care about the last row, each row is processed independently. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097560#comment-15097560 ] Sun Rui commented on SPARK-6817: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python.scala#L349 > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097552#comment-15097552 ] Antonio Piccolboni commented on SPARK-6817: --- I need to see the code to understand. Has this technique been used in pyspark? Thanks > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095588#comment-15095588 ] Sun Rui commented on SPARK-6817: Attached the first draft design doc, please review and give comments > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095590#comment-15095590 ] Sun Rui commented on SPARK-6817: [~mpollock], this PR will support row-based UDF. UDF operating on columns may be supported after R UDAF is supported. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095600#comment-15095600 ] Sun Rui commented on SPARK-6817: [~shivaram] I first focus on the row-based UDF functionality. For high-level APIs() like dapply(), I think that needs support of UDAF, which is not supported in this PR yet. I can create a new JIRA for supporting R UDAF. Any comments? > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095594#comment-15095594 ] Sun Rui commented on SPARK-6817: [~piccolbo] I am not sure If I understand your meaning. This is to support UDF in R code. Spark has already supported Scala/Python UDF. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095756#comment-15095756 ] Weiqiang Zhuang commented on SPARK-6817: We did see both apply use cases. But the block/group/column oriented apply is more important if we can have it earlier. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095734#comment-15095734 ] Jeff Zhang commented on SPARK-6817: --- +1 on block based API, UDF would usually call other R packages and most of R packages are for block based (R's dataframe), and this lead performance gain. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095747#comment-15095747 ] Reynold Xin commented on SPARK-6817: Please take a look at the original design doc for this: https://docs.google.com/document/d/1xa8gB705QFybQD7qEe-NcZZOtkfA1YY-eVhyaXtAtOM/edit > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095745#comment-15095745 ] Sun Rui commented on SPARK-6817: [~rxin] Row-oriented R UDF is for SQL and is similar to Python UDF. I am not making the R UDF depends on RRDD, but abstract the re-usable logic that can be shared by RRDD and R UDF, which is also similar to Python UDF. I don't know what the block means in "block oriented API"? something like GroupedData? I think that depends on UDAF support, which will be supported after UDF support. Maybe something I mis-understand? > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095776#comment-15095776 ] Antonio Piccolboni commented on SPARK-6817: --- My question made sense only wrt the block or vectorized design. If you are implementing plain-vanilla UDFs in R, my questions is meaningless. The performance implications of calling an R function for each row are ominous so I am not sure why you are going down this path. Imagine you want to add a column with random numbers from a distribution. You can use a regular UDF on each row or a block UDF on a block of a million rows. That means a single R call vs a million. system.time(rnorm(10^6)) user system elapsed 0.089 0.002 0.092 > z = rep_len(1, 10^6); system.time(sapply(z, rnorm)) user system elapsed 4.272 0.317 4.588 That's 45 times slower. Plus R is choke full of vectorized functions. There are no builtin scalar types in R. So there are plenty of examples of block UDF that one can write in R efficiently (no interpreter loops of any sort. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095721#comment-15095721 ] Reynold Xin commented on SPARK-6817: [~sunrui] Why are you focusing on a row-based API? I think a block oriented API in the original Google Docs makes a lot more sense. I also don't want the UDF to depend on RRDD, because we are going to remove RRDD from Spark once the UDFs are implemented. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15064664#comment-15064664 ] Matt Pollock commented on SPARK-6817: - Will this only support UDFs that operate on a full DataFrame? A solution to operate on columns would perhaps be more useful. E.g., being able to use R package functions within filter and mutate > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061420#comment-15061420 ] Antonio Piccolboni commented on SPARK-6817: --- Will this form of partition-UDF available only in R as an API? Will SQL or Python be supported? Thanks > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059658#comment-15059658 ] Sun Rui commented on SPARK-6817: Start working on it > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987101#comment-14987101 ] Michael Armbrust commented on SPARK-6817: - Should we bump this now that we are past code freeze for 1.6? > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720845#comment-14720845 ] Shivaram Venkataraman commented on SPARK-6817: -- The idea behind having `dapplyCollect` was that it might be easier to implement as the output doesn't necessarily need to be converted to a valid Spark DataFrame on the JVM and could instead be just any R data frame. But I agree that adding more keywords is confusing for users and we could avoid this in a couple of ways (1) implement the type conversion from R to JVM first so we wouldn't need this (2) have a slightly different class on the JVM that only supports collect on it (i.e. not a DataFrame) and use that to Regarding gapply -- SparkR (and dplyr) already have a `group_by` function that does the grouping and in SparkR this returns a `GroupedData` object. Right now the only function available on the `GroupedData` object is `agg` to perform aggregations on it. We could instead support `dapply` on `GroupedData` objects and then the syntax would be something like grouped_df - group_by(df, df$city) collect(dapply(grouped_df, function(group) {} )) cc [~rxin] DataFrame UDFs in R --- Key: SPARK-6817 URL: https://issues.apache.org/jira/browse/SPARK-6817 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman This depends on some internal interface of Spark SQL, should be done after merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717079#comment-14717079 ] Indrajit commented on SPARK-6817: -- Here are some suggestions on the proposed API. If the idea is to keep the API close to R's current primitives, we should avoid introducing too many new keywords. E.g., dapplyCollect can be expressed as collect(dapply(...)). Since collect already exists in Spark, and R users are comfortable with the syntax as part of dplyr, we shoud reuse the keyword instead of introducing a new function dapplyCollect. Relying on existing syntax will reduce the learning curve for users. Was performance the primary intent to introduce dapplyCollect instead of collect(dapply(...))? Similarly, can we do away with gapply and gapplyCollect, and express it using dapply? In R, the function split provides grouping (https://stat.ethz.ch/R-manual/R-devel/library/base/html/split.html). One should be able to implement split using GroupBy in Spark. gapply can then be expressed in terms of dapply and split, and gapplyCollect will become collect(dapply(..split..)). Here is a simple example that uses split and lapply in R: df-data.frame(city=c(A,B,A,D), age=c(10,12,23,5)) print(df) s-split(df$age, df$city) lapply(s, mean) DataFrame UDFs in R --- Key: SPARK-6817 URL: https://issues.apache.org/jira/browse/SPARK-6817 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman This depends on some internal interface of Spark SQL, should be done after merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711972#comment-14711972 ] Shivaram Venkataraman commented on SPARK-6817: -- I've created a design doc for this at https://docs.google.com/document/d/1xa8gB705QFybQD7qEe-NcZZOtkfA1YY-eVhyaXtAtOM/edit# One thing to note in the design is that the SparkR DataFrame UDFs will be more general than the Scala or Python versions and allow access to entire partitions as data frames. Along with support for grouped UDFs this API should be powerful enough to implement many complex operations. DataFrame UDFs in R --- Key: SPARK-6817 URL: https://issues.apache.org/jira/browse/SPARK-6817 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman This depends on some internal interface of Spark SQL, should be done after merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593823#comment-14593823 ] Aleksander Eskilson commented on SPARK-6817: Makes sense, thanks for the clarification. DataFrame UDFs in R --- Key: SPARK-6817 URL: https://issues.apache.org/jira/browse/SPARK-6817 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman This depends on some internal interface of Spark SQL, should be done after merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591984#comment-14591984 ] Aleksander Eskilson commented on SPARK-6817: It looks like this issue also relates to SPARK-3947. One commenter mentions expansion of the existing SparkSQL API to support UDAFs for the GroupedData object as well as DataFrames. Will the scope of this bug also include SparkRs' GroupedData implementation? DataFrame UDFs in R --- Key: SPARK-6817 URL: https://issues.apache.org/jira/browse/SPARK-6817 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman This depends on some internal interface of Spark SQL, should be done after merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592087#comment-14592087 ] Shivaram Venkataraman commented on SPARK-6817: -- I think there are two separate issues here -- one is to add just UDFs which will work on rows during Select / Filter etc. and UDAFs which will work on grouped data. I think UDFs are the first step to do in SparkR as there is already support for this in Scala, Python Supporting UDAFs will be the next step but I think that will be blocked on SPARK-3947 DataFrame UDFs in R --- Key: SPARK-6817 URL: https://issues.apache.org/jira/browse/SPARK-6817 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman This depends on some internal interface of Spark SQL, should be done after merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org