[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-06-28 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353142#comment-15353142
 ] 

Narine Kokhlikyan edited comment on SPARK-12922 at 6/28/16 3:03 PM:


Thank you [~timhunter] for sharing this information with us.
It is a nice idea. I think that it could be seen as an extension of current 
gapply's implementation.

 I think that, in general, whether the keys are useful or not depends on the 
use case. Most probably, the user, naturally, would like to see the matching 
key of each group-output and it would make sense to attach/append the keys by 
default.
If the user doesn't need the keys he or she can easily detach/drop those 
columns.


was (Author: narine):
Thank you [~timhunter] for sharing this information with us.
It is a nice idea. I think that it could be seen as an extension of current 
gapply's implementation.

In general, I think that whether the keys are useful or not depends on the use 
case. Most probably, the user, naturally, would like to see the matching key of 
each group-output and it would make sense to attach/append the keys by default.
If the user doesn't need the keys he or she can easily detach/drop those 
columns.

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>Assignee: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-06-15 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15333125#comment-15333125
 ] 

Narine Kokhlikyan edited comment on SPARK-12922 at 6/16/16 5:25 AM:


FYI, [~olarayej], [~aloknsingh], [~vijayrb] :)


was (Author: narine):
FYI, [~olarayej], [~aloknsingh], [~vijayrb]!

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>Assignee: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-29 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264786#comment-15264786
 ] 

Narine Kokhlikyan edited comment on SPARK-12922 at 4/29/16 10:01 PM:
-

I think that it is better to use TypedColumns. 

Smth similar to: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L264
I don't think that there is a support for Typed columns in SparkR, is there ?

In that case we could create an encoder similar to:
ExpressionEncoder.tuple(ExpressionEncoder[String], ExpressionEncoder[Int], 
ExpressionEncoder[Double])

Is there a way to access the mapping between spark and scala type ?
Like:
IntegerType(spark) -> Int(scala)

Thank you!




was (Author: narine):
I think that it is better to use TypedColumns. 

Smth similar to: 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/KeyValueGroupedDataset.scala#L264
I don't think that there is a support for Typed columns in SparkR, is there ?

In that case we could create an encoder similar to:
ExpressionEncoder.tuple(ExpressionEncoder[String], ExpressionEncoder[Int], 
ExpressionEncoder[Double])

Is there a way to map spark type to scala type ?
Like:
IntegerType(spark) -> Int(scala)

Thank you!



> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-04-10 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15233886#comment-15233886
 ] 

Narine Kokhlikyan edited comment on SPARK-12922 at 4/10/16 7:23 AM:


Hi [~sunrui],

I have a question regarding your suggestion about adding a new 
"GroupedData.flatMapRGroups" function according to the following document:
https://docs.google.com/presentation/d/1oj17N5JaE8JDjT2as_DUI6LKutLcEHNZB29HsRGL_dM/edit#slide=id.p9

It seems that some changes have happened in SparkSQL. According to 1.6.1 there 
was a scala class called:
https://github.com/apache/spark/blob/v1.6.1/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala

This doesn't seem to exist in 2.0.0

I was thinking to add the flatMapRGroups helper function to 
org.apache.spark.sql.KeyValueGroupedDataset or 
org.apache.spark.sql.RelationalGroupedDataset. What do you think ?

Thank you,
Narine



was (Author: narine):
Hi [~sunrui],

I have a question regarding your suggestion about adding a new 
"GroupedData.flatMapRGroups" function according to the following document:
https://docs.google.com/presentation/d/1oj17N5JaE8JDjT2as_DUI6LKutLcEHNZB29HsRGL_dM/edit#slide=id.p9

It seems that some changes has happened in SparkSQL. According to 1.6.1 there 
was a scala class called:
https://github.com/apache/spark/blob/v1.6.1/sql/core/src/main/scala/org/apache/spark/sql/GroupedData.scala

This doesn't seem to exist in 2.0.0

I was thinking to add the flatMapRGroups helper function to 
org.apache.spark.sql.KeyValueGroupedDataset or 
org.apache.spark.sql.RelationalGroupedDataset. What do you think ?

Thank you,
Narine


> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-02-24 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15163598#comment-15163598
 ] 

Narine Kokhlikyan edited comment on SPARK-12922 at 2/24/16 7:48 PM:


Hi [~sunrui],

I looked at the implementation proposal and it looks good to me. But, I think 
it would be good to add some  details about the aggregation of the 
data/dataframes which we receive from workers.

I've tried to draw a diagram, for the example of group-apply in order to 
understand the bigger picture. 
https://docs.google.com/document/d/1z-sghU8wYKW-oNOajzFH02X0CP9Vd67cuJ085e93vZ8/edit
Please, let me know if I've understood smth wrongly ?

Thanks,
Narine



was (Author: narine):
Hi [~sunrui],

I looked at the implementation proposal and it looks good to me. But, I think 
it would be good to add some  details about the aggregation of the 
data/dataframes which we receive from workers.

I've tried to draw a diagram, for the example of group-apply in order to get 
the big picture. 
https://docs.google.com/document/d/1z-sghU8wYKW-oNOajzFH02X0CP9Vd67cuJ085e93vZ8/edit
Please, let me know if I've understood smth wrongly ?

Thanks,
Narine


> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-02-22 Thread Narine Kokhlikyan (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157373#comment-15157373
 ] 

Narine Kokhlikyan edited comment on SPARK-12922 at 2/22/16 5:47 PM:


thanks, for creating this jira, [~sunrui]
Have you already started to work on this ? This most probably depends on, 
[https://issues.apache.org/jira/browse/SPARK-12792].
We need this as soon as possible and I might start working on this.
Do you have any time estimation how long will it take to get  
[https://issues.apache.org/jira/browse/SPARK-12792] reviewed ?

cc: [~shivaram]

Thanks,
Narine


was (Author: narine):
thanks, for creating this jira, [~sunrui]
Have you already started to work on this ? This most probably depends on, 
[https://issues.apache.org/jira/browse/SPARK-12792].
We need this as soon as possible and I might start working on this ?
Do you have any time estimation how long will it take to get  
[https://issues.apache.org/jira/browse/SPARK-12792] reviewed ?

Thanks,
Narine

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org