[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095721#comment-15095721
 ] 

Reynold Xin edited comment on SPARK-6817 at 1/13/16 6:57 AM:
-

[~sunrui]

Why are you focusing on a row-based API? I think a block oriented API in the 
original Google Docs makes more sense. I also don't think the UDF should depend 
on RRDD, because we are going to remove RRDD from Spark once the UDFs are 
implemented.




was (Author: rxin):
[~sunrui]

Why are you focusing on a row-based API? I think a block oriented API in the 
original Google Docs makes more sense. I also don't want the UDF to depend on 
RRDD, because we are going to remove RRDD from Spark once the UDFs are 
implemented.



> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095721#comment-15095721
 ] 

Reynold Xin edited comment on SPARK-6817 at 1/13/16 6:58 AM:
-

[~sunrui]

Why are you focusing on a row-based API? I think a block oriented API in the 
original Google Docs makes more sense. I also don't think the UDF should depend 
on RRDD, because we are going to remove RRDD from Spark once the UDFs are 
implemented.

In order to support the row-oriented API efficiently, we'd need to replicate 
all the infrastructure built for Python. I don't think that is maintainable in 
the long run.



was (Author: rxin):
[~sunrui]

Why are you focusing on a row-based API? I think a block oriented API in the 
original Google Docs makes more sense. I also don't think the UDF should depend 
on RRDD, because we are going to remove RRDD from Spark once the UDFs are 
implemented.



> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Jeff Zhang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095734#comment-15095734
 ] 

Jeff Zhang edited comment on SPARK-6817 at 1/13/16 7:09 AM:


+1 on block based API, UDF would usually call other R packages and most of R 
packages are block based (R's dataframe), and this lead performance gain.


was (Author: zjffdu):
+1 on block based API, UDF would usually call other R packages and most of R 
packages are for block based (R's dataframe), and this lead performance gain.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095721#comment-15095721
 ] 

Reynold Xin edited comment on SPARK-6817 at 1/13/16 6:57 AM:
-

[~sunrui]

Why are you focusing on a row-based API? I think a block oriented API in the 
original Google Docs makes more sense. I also don't want the UDF to depend on 
RRDD, because we are going to remove RRDD from Spark once the UDFs are 
implemented.




was (Author: rxin):
[~sunrui]

Why are you focusing on a row-based API? I think a block oriented API in the 
original Google Docs makes a lot more sense. I also don't want the UDF to 
depend on RRDD, because we are going to remove RRDD from Spark once the UDFs 
are implemented.



> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R

2016-01-12 Thread Antonio Piccolboni (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095776#comment-15095776
 ] 

Antonio Piccolboni edited comment on SPARK-6817 at 1/13/16 7:41 AM:


My question made sense only wrt the block or vectorized design. If you are 
implementing plain-vanilla UDFs in R, my questions is meaningless. The 
performance implications of calling an R function for each row are ominous so I 
am not sure why you are going down this path. Imagine you want to add a column 
with random numbers from a distribution. You can use a regular UDF on each row 
or a block UDF on a block of a million rows. That means a single R call vs a 
million.

system.time(rnorm(10^6))
   user  system elapsed 
  0.089   0.002   0.092 
> z = rep_len(1, 10^6); system.time(sapply(z, rnorm))
   user  system elapsed 
  4.272   0.317   4.588 

That's 45 times slower. Plus R is choke full of vectorized functions. There are 
no builtin scalar types  in R. So there are plenty of examples of block UDF 
that one can write in R efficiently (no interpreter loops of any sort).


was (Author: piccolbo):
My question made sense only wrt the block or vectorized design. If you are 
implementing plain-vanilla UDFs in R, my questions is meaningless. The 
performance implications of calling an R function for each row are ominous so I 
am not sure why you are going down this path. Imagine you want to add a column 
with random numbers from a distribution. You can use a regular UDF on each row 
or a block UDF on a block of a million rows. That means a single R call vs a 
million.

system.time(rnorm(10^6))
   user  system elapsed 
  0.089   0.002   0.092 
> z = rep_len(1, 10^6); system.time(sapply(z, rnorm))
   user  system elapsed 
  4.272   0.317   4.588 

That's 45 times slower. Plus R is choke full of vectorized functions. There are 
no builtin scalar types  in R. So there are plenty of examples of block UDF 
that one can write in R efficiently (no interpreter loops of any sort.

> DataFrame UDFs in R
> ---
>
> Key: SPARK-6817
> URL: https://issues.apache.org/jira/browse/SPARK-6817
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR, SQL
>Reporter: Shivaram Venkataraman
> Attachments: SparkR UDF Design Documentation v1.pdf
>
>
> This depends on some internal interface of Spark SQL, should be done after 
> merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R

2015-08-29 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717079#comment-14717079
 ] 

Reynold Xin edited comment on SPARK-6817 at 8/30/15 1:14 AM:
-

Here are some suggestions on the proposed API. If the idea is to keep the API 
close to R's current primitives, we should avoid 
introducing too many new keywords. E.g., dapplyCollect can be expressed as 
collect(dapply(...)). Since collect already exists in Spark,
and R users are comfortable with the syntax as part of dplyr, we shoud reuse 
the keyword instead of introducing a new function dapplyCollect. 
Relying on existing syntax will reduce the learning curve for users. Was 
performance the primary intent to introduce dapplyCollect instead of
collect(dapply(...))?

Similarly, can we do away with gapply and gapplyCollect, and express it using 
dapply? In R, the function split provides grouping 
(https://stat.ethz.ch/R-manual/R-devel/library/base/html/split.html). One 
should be able to implement split using GroupBy in Spark.
gapply can then be expressed in terms of dapply and split, and gapplyCollect 
will become collect(dapply(..split..)). 
Here is a simple example that uses split and lapply in R:

{code}
df-data.frame(city=c(A,B,A,D), age=c(10,12,23,5))
print(df)
s-split(df$age, df$city)
lapply(s, mean)
{code}


was (Author: indrajit):
Here are some suggestions on the proposed API. If the idea is to keep the API 
close to R's current primitives, we should avoid 
introducing too many new keywords. E.g., dapplyCollect can be expressed as 
collect(dapply(...)). Since collect already exists in Spark,
and R users are comfortable with the syntax as part of dplyr, we shoud reuse 
the keyword instead of introducing a new function dapplyCollect. 
Relying on existing syntax will reduce the learning curve for users. Was 
performance the primary intent to introduce dapplyCollect instead of
collect(dapply(...))?

Similarly, can we do away with gapply and gapplyCollect, and express it using 
dapply? In R, the function split provides grouping 
(https://stat.ethz.ch/R-manual/R-devel/library/base/html/split.html). One 
should be able to implement split using GroupBy in Spark.
gapply can then be expressed in terms of dapply and split, and gapplyCollect 
will become collect(dapply(..split..)). 
Here is a simple example that uses split and lapply in R:

df-data.frame(city=c(A,B,A,D), age=c(10,12,23,5))
print(df)
s-split(df$age, df$city)
lapply(s, mean)

 DataFrame UDFs in R
 ---

 Key: SPARK-6817
 URL: https://issues.apache.org/jira/browse/SPARK-6817
 Project: Spark
  Issue Type: New Feature
  Components: SparkR, SQL
Reporter: Shivaram Venkataraman

 This depends on some internal interface of Spark SQL, should be done after 
 merging into Spark.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org