[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095721#comment-15095721 ] Reynold Xin edited comment on SPARK-6817 at 1/13/16 6:57 AM: - [~sunrui] Why are you focusing on a row-based API? I think a block oriented API in the original Google Docs makes more sense. I also don't think the UDF should depend on RRDD, because we are going to remove RRDD from Spark once the UDFs are implemented. was (Author: rxin): [~sunrui] Why are you focusing on a row-based API? I think a block oriented API in the original Google Docs makes more sense. I also don't want the UDF to depend on RRDD, because we are going to remove RRDD from Spark once the UDFs are implemented. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095721#comment-15095721 ] Reynold Xin edited comment on SPARK-6817 at 1/13/16 6:58 AM: - [~sunrui] Why are you focusing on a row-based API? I think a block oriented API in the original Google Docs makes more sense. I also don't think the UDF should depend on RRDD, because we are going to remove RRDD from Spark once the UDFs are implemented. In order to support the row-oriented API efficiently, we'd need to replicate all the infrastructure built for Python. I don't think that is maintainable in the long run. was (Author: rxin): [~sunrui] Why are you focusing on a row-based API? I think a block oriented API in the original Google Docs makes more sense. I also don't think the UDF should depend on RRDD, because we are going to remove RRDD from Spark once the UDFs are implemented. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095734#comment-15095734 ] Jeff Zhang edited comment on SPARK-6817 at 1/13/16 7:09 AM: +1 on block based API, UDF would usually call other R packages and most of R packages are block based (R's dataframe), and this lead performance gain. was (Author: zjffdu): +1 on block based API, UDF would usually call other R packages and most of R packages are for block based (R's dataframe), and this lead performance gain. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095721#comment-15095721 ] Reynold Xin edited comment on SPARK-6817 at 1/13/16 6:57 AM: - [~sunrui] Why are you focusing on a row-based API? I think a block oriented API in the original Google Docs makes more sense. I also don't want the UDF to depend on RRDD, because we are going to remove RRDD from Spark once the UDFs are implemented. was (Author: rxin): [~sunrui] Why are you focusing on a row-based API? I think a block oriented API in the original Google Docs makes a lot more sense. I also don't want the UDF to depend on RRDD, because we are going to remove RRDD from Spark once the UDFs are implemented. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15095776#comment-15095776 ] Antonio Piccolboni edited comment on SPARK-6817 at 1/13/16 7:41 AM: My question made sense only wrt the block or vectorized design. If you are implementing plain-vanilla UDFs in R, my questions is meaningless. The performance implications of calling an R function for each row are ominous so I am not sure why you are going down this path. Imagine you want to add a column with random numbers from a distribution. You can use a regular UDF on each row or a block UDF on a block of a million rows. That means a single R call vs a million. system.time(rnorm(10^6)) user system elapsed 0.089 0.002 0.092 > z = rep_len(1, 10^6); system.time(sapply(z, rnorm)) user system elapsed 4.272 0.317 4.588 That's 45 times slower. Plus R is choke full of vectorized functions. There are no builtin scalar types in R. So there are plenty of examples of block UDF that one can write in R efficiently (no interpreter loops of any sort). was (Author: piccolbo): My question made sense only wrt the block or vectorized design. If you are implementing plain-vanilla UDFs in R, my questions is meaningless. The performance implications of calling an R function for each row are ominous so I am not sure why you are going down this path. Imagine you want to add a column with random numbers from a distribution. You can use a regular UDF on each row or a block UDF on a block of a million rows. That means a single R call vs a million. system.time(rnorm(10^6)) user system elapsed 0.089 0.002 0.092 > z = rep_len(1, 10^6); system.time(sapply(z, rnorm)) user system elapsed 4.272 0.317 4.588 That's 45 times slower. Plus R is choke full of vectorized functions. There are no builtin scalar types in R. So there are plenty of examples of block UDF that one can write in R efficiently (no interpreter loops of any sort. > DataFrame UDFs in R > --- > > Key: SPARK-6817 > URL: https://issues.apache.org/jira/browse/SPARK-6817 > Project: Spark > Issue Type: New Feature > Components: SparkR, SQL >Reporter: Shivaram Venkataraman > Attachments: SparkR UDF Design Documentation v1.pdf > > > This depends on some internal interface of Spark SQL, should be done after > merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-6817) DataFrame UDFs in R
[ https://issues.apache.org/jira/browse/SPARK-6817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14717079#comment-14717079 ] Reynold Xin edited comment on SPARK-6817 at 8/30/15 1:14 AM: - Here are some suggestions on the proposed API. If the idea is to keep the API close to R's current primitives, we should avoid introducing too many new keywords. E.g., dapplyCollect can be expressed as collect(dapply(...)). Since collect already exists in Spark, and R users are comfortable with the syntax as part of dplyr, we shoud reuse the keyword instead of introducing a new function dapplyCollect. Relying on existing syntax will reduce the learning curve for users. Was performance the primary intent to introduce dapplyCollect instead of collect(dapply(...))? Similarly, can we do away with gapply and gapplyCollect, and express it using dapply? In R, the function split provides grouping (https://stat.ethz.ch/R-manual/R-devel/library/base/html/split.html). One should be able to implement split using GroupBy in Spark. gapply can then be expressed in terms of dapply and split, and gapplyCollect will become collect(dapply(..split..)). Here is a simple example that uses split and lapply in R: {code} df-data.frame(city=c(A,B,A,D), age=c(10,12,23,5)) print(df) s-split(df$age, df$city) lapply(s, mean) {code} was (Author: indrajit): Here are some suggestions on the proposed API. If the idea is to keep the API close to R's current primitives, we should avoid introducing too many new keywords. E.g., dapplyCollect can be expressed as collect(dapply(...)). Since collect already exists in Spark, and R users are comfortable with the syntax as part of dplyr, we shoud reuse the keyword instead of introducing a new function dapplyCollect. Relying on existing syntax will reduce the learning curve for users. Was performance the primary intent to introduce dapplyCollect instead of collect(dapply(...))? Similarly, can we do away with gapply and gapplyCollect, and express it using dapply? In R, the function split provides grouping (https://stat.ethz.ch/R-manual/R-devel/library/base/html/split.html). One should be able to implement split using GroupBy in Spark. gapply can then be expressed in terms of dapply and split, and gapplyCollect will become collect(dapply(..split..)). Here is a simple example that uses split and lapply in R: df-data.frame(city=c(A,B,A,D), age=c(10,12,23,5)) print(df) s-split(df$age, df$city) lapply(s, mean) DataFrame UDFs in R --- Key: SPARK-6817 URL: https://issues.apache.org/jira/browse/SPARK-6817 Project: Spark Issue Type: New Feature Components: SparkR, SQL Reporter: Shivaram Venkataraman This depends on some internal interface of Spark SQL, should be done after merging into Spark. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org