[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-06-18 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337728#comment-15337728
 ] 

Sean Owen commented on SPARK-6817:
--

No, the best thing is just bulk-changing the issues to stand-alone issues. I 
can do that.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-06-17 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337525#comment-15337525
 ] 

Shivaram Venkataraman commented on SPARK-6817:
--

I think all the ones we need for 2.0 are completed here.  

[~srowen] Is there a clean way to mark the umbrella as complete for 2.0 and 
retarget the remaining for 2.1 ?

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-04-29 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264954#comment-15264954
 ] 

Shivaram Venkataraman commented on SPARK-6817:
--

I just merged https://issues.apache.org/jira/browse/SPARK-12919 which contains 
the main part of UDFs (`dapply`). I think we'll have a few follow up PRs during 
the QA period - but lets leave this as 2.0 target.



> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-04-29 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264836#comment-15264836
 ] 

Michael Armbrust commented on SPARK-6817:
-

[~shivaram] Sill trying to get any of this in Spark 2.0?  Or should we retarget?

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-21 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110299#comment-15110299
 ] 

Felix Cheung commented on SPARK-6817:
-

Thanks for putting together on the doc [~sunrui]
In this design, how does one control the partitioning? For instance, suppose 
one would like to group census data DataFrame by a certain column, say 
MetropolitanArea, and then pass to R's kmeans to cluster residents within 
close-by geographical areas. In order for the R UDFs to be effective, in this 
and some other cases, one would need to make sure the data is partition 
appropriately, and that mapPartition would produce a local R data.frame 
(assuming it fits into memory) that has all the relevant data in it?
 

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-21 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15110359#comment-15110359
 ] 

Sun Rui commented on SPARK-6817:


for dapply(), user can call repartition() to set an appropriate number of 
partitions before calling dapply().

for gapply(), the SQL conf "spark.sql.shuffle.partitions" could be used to tune 
the partitions number after shuffle. I am also hoping SPARK-9850 Adaptive 
execution in Spark could help.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-19 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108097#comment-15108097
 ] 

Sun Rui commented on SPARK-6817:


Moved SQL UDF related stuff to SPARK-12918.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-19 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15108098#comment-15108098
 ] 

Sun Rui commented on SPARK-6817:


I wrote an implementation document at 
https://docs.google.com/presentation/d/1oj17N5JaE8JDjT2as_DUI6LKutLcEHNZB29HsRGL_dM/edit?usp=sharing,
 please help to review it and give comments.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-15 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15101805#comment-15101805
 ] 

Sun Rui commented on SPARK-6817:


Spark is now supporting vectorized execution via Columnar batch. See 
SPARK-12785 and SPARK-12635. I hope this could benefit SparkR UDF.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-13 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095860#comment-15095860
 ] 

Sun Rui commented on SPARK-6817:


I agree R's efficiency comes from vectorization. Here UDF is a function can be 
invoked in SQL queries, which is row-oriented. But row-orientation does not 
necessarily means R UDF will process one row each time. Actually, projected 
rows (according to the input parameters for a UDF) can be batched or even as a 
whole in a partition (if no OOM is concerned) and then passed into an R worker. 
The R worker can load the batch of rows into vectors or lists in memory and the 
R UDF can still do vectorized operations.

Here the point is support of column-oriented UDF, which is something like UDAF, 
but I doubt UDAF is not exact match, because UDAF only returns only one value 
for a column. But in R, operations on a column may still return another 
non-scalar column.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-13 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095868#comment-15095868
 ] 

Sun Rui commented on SPARK-6817:


If we think that column-oriented UDF is more important, I can do it with a 
higher priority. Just need some help on technical insights. I doubt UDAF is not 
exact match, because UDAF only returns only one value for a column. But in R, 
operations on a column may still return another non-scalar column. And in order 
to seamlessly use existing R functions, a column in question need to be loaded 
as a whole into memory as a vector or list, which may suffer OOM if the column 
has too many rows.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-13 Thread Antonio Piccolboni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15096738#comment-15096738
 ] 

Antonio Piccolboni commented on SPARK-6817:
---

So I am not sure row-orientation means anything anymore. Could you please point 
me to any examples or documentation for "projected rows" and "batching" UDFs? 
Sorry I am writing my first few UDFs, I may just need more education. The 
mechanism I am using is to extend the org.apache.hadoop.hive.ql.exec.UDF and 
provide an evaluate method. This is called once for each row. I don't have to 
invoke R at every call, but I have to return something, which is used directly 
or indirectly to create elements in a new column. Plus,  and that may just be 
ignorance on my part, I don't know of any method that is invoked when all the 
data has been seen or any other way to detect I am on the last record from 
evaluate. I don't see how I can batch say 1000 rows and compute without this 
information. What happens when I batched the last incomplete batch and the last 
call to evaluate happens with no knowledge it's the last one?

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-13 Thread Antonio Piccolboni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097712#comment-15097712
 ] 

Antonio Piccolboni commented on SPARK-6817:
---

Thanks!

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-13 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097723#comment-15097723
 ] 

Sun Rui commented on SPARK-6817:


OK. I will follow the design of the original design doc, while I think the term 
of UDF here is a little bit confusing.

For dapply(), basically it sounds like passing an R function into 
DataFrame.mapPartitions(). The R function takes a local data.frame as input 
parameter, which is converted from a partition of the DataFrame.

For gapply(), it makes less sense to depend on UDAF, as 1. UDAF returns single 
value, 2. UDAF is processed with each row each time, which is not efficient. 
basically, converts the DataFrame to an RDD, and then call RDD.groupBy(), and 
then feed the grouped values into R worker. 

cc [~shivaram], any comments?

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-13 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097458#comment-15097458
 ] 

Sun Rui commented on SPARK-6817:


projecting batching rows for UDF are implmentation optimizations for saving 
communication cost between JVM and non-JVM interpreter like python or R, so 
there is no documentation. Conceptually, UDF is passed in a row each time. But 
for R, which typically handles vectors, it is feasible to transform the batch 
of rows into columns and pass the column vector into R UDF as arguments. But 
this may need clear statement saying that R UDF is expected to handle vector 
arguments from a batch of rows. The output of R UDF is still vectors, that can 
be passed back to JVM as result. In this way, the UDF actually is called once 
on the batch of rows.

For UDF, you don't need care about the last row, each row is processed 
independently.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-13 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097560#comment-15097560
 ] 

Sun Rui commented on SPARK-6817:


https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/python.scala#L349

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-13 Thread Antonio Piccolboni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15097552#comment-15097552
 ] 

Antonio Piccolboni commented on SPARK-6817:
---

I need to see the code to understand. Has this technique been used in pyspark? 
Thanks

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095588#comment-15095588
 ] 

Sun Rui commented on SPARK-6817:


Attached the first draft design doc, please review and give comments

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095590#comment-15095590
 ] 

Sun Rui commented on SPARK-6817:


[~mpollock], this PR will support row-based UDF. UDF operating on columns may 
be supported after R UDAF is supported.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095600#comment-15095600
 ] 

Sun Rui commented on SPARK-6817:


[~shivaram] I first focus on the row-based UDF functionality. For high-level 
APIs() like dapply(), I think that needs support of UDAF, which is not 
supported in this PR yet. I can create a new JIRA for supporting R UDAF. Any 
comments?

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095594#comment-15095594
 ] 

Sun Rui commented on SPARK-6817:


[~piccolbo] I am not sure If I understand your meaning. This is to support UDF 
in R code. Spark has already supported Scala/Python UDF.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Weiqiang Zhuang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095756#comment-15095756
 ] 

Weiqiang Zhuang commented on SPARK-6817:


We did see both apply use cases. But the block/group/column oriented apply is 
more important if we can have it earlier.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095734#comment-15095734
 ] 

Jeff Zhang commented on SPARK-6817:
---

+1 on block based API, UDF would usually call other R packages and most of R 
packages are for block based (R's dataframe), and this lead performance gain.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095747#comment-15095747
 ] 

Reynold Xin commented on SPARK-6817:


Please take a look at the original design doc for this: 
https://docs.google.com/document/d/1xa8gB705QFybQD7qEe-NcZZOtkfA1YY-eVhyaXtAtOM/edit



> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095745#comment-15095745
 ] 

Sun Rui commented on SPARK-6817:


[~rxin] Row-oriented R UDF is for SQL and is similar to Python UDF. I am not 
making the R UDF depends on RRDD, but abstract the re-usable logic that can be 
shared by RRDD and R UDF, which is also similar to Python UDF. 
I don't know what the block means in "block oriented API"? something like 
GroupedData? I think that depends on UDAF support, which will be supported 
after UDF support. Maybe something I mis-understand?

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Antonio Piccolboni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095776#comment-15095776
 ] 

Antonio Piccolboni commented on SPARK-6817:
---

My question made sense only wrt the block or vectorized design. If you are 
implementing plain-vanilla UDFs in R, my questions is meaningless. The 
performance implications of calling an R function for each row are ominous so I 
am not sure why you are going down this path. Imagine you want to add a column 
with random numbers from a distribution. You can use a regular UDF on each row 
or a block UDF on a block of a million rows. That means a single R call vs a 
million.

system.time(rnorm(10^6))
   user  system elapsed 
  0.089   0.002   0.092 
> z = rep_len(1, 10^6); system.time(sapply(z, rnorm))
   user  system elapsed 
  4.272   0.317   4.588 

That's 45 times slower. Plus R is choke full of vectorized functions. There are 
no builtin scalar types  in R. So there are plenty of examples of block UDF 
that one can write in R efficiently (no interpreter loops of any sort.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095721#comment-15095721
 ] 

Reynold Xin commented on SPARK-6817:


[~sunrui]

Why are you focusing on a row-based API? I think a block oriented API in the 
original Google Docs makes a lot more sense. I also don't want the UDF to 
depend on RRDD, because we are going to remove RRDD from Spark once the UDFs 
are implemented.



> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2015-12-18 Thread Matt Pollock (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15064664#comment-15064664
 ] 

Matt Pollock commented on SPARK-6817:
-

Will this only support UDFs that operate on a full DataFrame? A solution to 
operate on columns would perhaps be more useful. E.g., being able to use R 
package functions within filter and mutate

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2015-12-16 Thread Antonio Piccolboni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061420#comment-15061420
 ] 

Antonio Piccolboni commented on SPARK-6817:
---

Will this form of partition-UDF available only in R as an API? Will  SQL or 
Python be supported? Thanks 

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2015-12-15 Thread Sun Rui (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15059658#comment-15059658
 ] 

Sun Rui commented on SPARK-6817:


Start working on it

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2015-11-03 Thread Michael Armbrust (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987101#comment-14987101
 ] 

Michael Armbrust commented on SPARK-6817:
-

Should we bump this now that we are past code freeze for 1.6?

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2015-08-28 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14720845#comment-14720845
 ] 

Shivaram Venkataraman commented on SPARK-6817:
--

The idea behind having `dapplyCollect` was that it might be easier to implement 
as the output doesn't necessarily need to be converted to a valid Spark 
DataFrame on the JVM and could instead be just any R data frame. But I agree 
that adding more keywords is confusing for users and we could avoid this in a 
couple of ways
(1) implement the type conversion from R to JVM first so we wouldn't need this 
(2) have a slightly different class on the JVM that only supports collect on it 
(i.e. not a DataFrame) and use that to 

Regarding gapply -- SparkR (and dplyr) already have a `group_by` function that 
does the grouping and in SparkR this returns a `GroupedData` object. Right now 
the only function available on the `GroupedData` object is `agg` to perform 
aggregations on it. We could instead support `dapply` on `GroupedData` objects 
and then the syntax would be something like

grouped_df - group_by(df, df$city)
collect(dapply(grouped_df, function(group) {} ))

cc [~rxin]

 DataFrame UDFs in R
 ---

 Key: SPARK-6817
 URL: https://issues.apache.org/jira/browse/SPARK-6817
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman

 This depends on some internal interface of Spark SQL, should be done after 
 merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2015-08-27 Thread Indrajit (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717079#comment-14717079
 ] 

Indrajit  commented on SPARK-6817:
--

Here are some suggestions on the proposed API. If the idea is to keep the API 
close to R's current primitives, we should avoid 
introducing too many new keywords. E.g., dapplyCollect can be expressed as 
collect(dapply(...)). Since collect already exists in Spark,
and R users are comfortable with the syntax as part of dplyr, we shoud reuse 
the keyword instead of introducing a new function dapplyCollect. 
Relying on existing syntax will reduce the learning curve for users. Was 
performance the primary intent to introduce dapplyCollect instead of
collect(dapply(...))?

Similarly, can we do away with gapply and gapplyCollect, and express it using 
dapply? In R, the function split provides grouping 
(https://stat.ethz.ch/R-manual/R-devel/library/base/html/split.html). One 
should be able to implement split using GroupBy in Spark.
gapply can then be expressed in terms of dapply and split, and gapplyCollect 
will become collect(dapply(..split..)). 
Here is a simple example that uses split and lapply in R:

df-data.frame(city=c(A,B,A,D), age=c(10,12,23,5))
print(df)
s-split(df$age, df$city)
lapply(s, mean)

 DataFrame UDFs in R
 ---

 Key: SPARK-6817
 URL: https://issues.apache.org/jira/browse/SPARK-6817
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman

 This depends on some internal interface of Spark SQL, should be done after 
 merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2015-08-25 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14711972#comment-14711972
 ] 

Shivaram Venkataraman commented on SPARK-6817:
--

I've created a design doc for this at 
https://docs.google.com/document/d/1xa8gB705QFybQD7qEe-NcZZOtkfA1YY-eVhyaXtAtOM/edit#
 
One thing to note in the design is that the SparkR DataFrame UDFs will be more 
general than the Scala or Python versions and allow access to entire partitions 
as data frames. Along with support for grouped UDFs this API should be powerful 
enough to implement many complex operations.

 DataFrame UDFs in R
 ---

 Key: SPARK-6817
 URL: https://issues.apache.org/jira/browse/SPARK-6817
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman

 This depends on some internal interface of Spark SQL, should be done after 
 merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2015-06-19 Thread Aleksander Eskilson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14593823#comment-14593823
 ] 

Aleksander Eskilson commented on SPARK-6817:


Makes sense, thanks for the clarification. 

 DataFrame UDFs in R
 ---

 Key: SPARK-6817
 URL: https://issues.apache.org/jira/browse/SPARK-6817
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman

 This depends on some internal interface of Spark SQL, should be done after 
 merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2015-06-18 Thread Aleksander Eskilson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591984#comment-14591984
 ] 

Aleksander Eskilson commented on SPARK-6817:


It looks like this issue also relates to SPARK-3947. One commenter mentions 
expansion of the existing SparkSQL API to support UDAFs for the GroupedData 
object as well as DataFrames. Will the scope of this bug also include SparkRs' 
GroupedData implementation?

 DataFrame UDFs in R
 ---

 Key: SPARK-6817
 URL: https://issues.apache.org/jira/browse/SPARK-6817
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman

 This depends on some internal interface of Spark SQL, should be done after 
 merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6817) DataFrame UDFs in R

2015-06-18 Thread Shivaram Venkataraman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592087#comment-14592087
 ] 

Shivaram Venkataraman commented on SPARK-6817:
--

I think there are two separate issues here -- one is to add just UDFs which 
will work on rows during Select / Filter etc. and UDAFs which will work on 
grouped data. I think UDFs are the first step to do in SparkR as there is 
already support for this in Scala, Python

Supporting UDAFs will be the next step but I think that will be blocked on 
SPARK-3947

 DataFrame UDFs in R
 ---

 Key: SPARK-6817
 URL: https://issues.apache.org/jira/browse/SPARK-6817
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman

 This depends on some internal interface of Spark SQL, should be done after 
 merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org