[GitHub] spark issue #15135: [pyspark][group]pyspark GroupedData can't apply agg func...

2016-09-25 Thread citoubest
Github user citoubest commented on the issue:

https://github.com/apache/spark/pull/15135
  
@davies, what do you think about this patch? Can you give me some advice? 
Thanks


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15135: [pyspark][group]pyspark GroupedData can't apply agg func...

2016-09-20 Thread citoubest
Github user citoubest commented on the issue:

https://github.com/apache/spark/pull/15135
  
with pandas, the param for agg is the function not a str (function names).
In [13]: df
Out[13]: 
  a b c  d
0  0.068300  0.263883  0.237335  1
1  0.226992  0.573966  0.954791  2
2  0.907550  0.930591  0.886454  1
3  0.178581  0.440734  0.414763  2

In [14]: df.groupby('d').agg([max,min])
Out[14]: 
  a   b   c  
max   min   max   min   max   min
d
1  0.907550  0.068300  0.930591  0.263883  0.886454  0.237335
2  0.226992  0.178581  0.573966  0.440734  0.954791  0.414763



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15135: [pyspark][group]pyspark GroupedData can't apply agg func...

2016-09-20 Thread citoubest
Github user citoubest commented on the issue:

https://github.com/apache/spark/pull/15135
  
OK,  because pandas dataframe support the added approach to agg, so I 
suppose maybe spark dataframe should support, but it not. So I have tried to 
add this patch. If you think this patch is not necessary , I will close this 
request later. @rxin .


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15135: [pyspark][group]pyspark GroupedData can't apply agg func...

2016-09-19 Thread citoubest
Github user citoubest commented on the issue:

https://github.com/apache/spark/pull/15135
  
  @rxin @davies @srowen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #15135: [pyspark][group]pyspark GroupedData can't apply agg func...

2016-09-18 Thread citoubest
Github user citoubest commented on the issue:

https://github.com/apache/spark/pull/15135
  
@petermaxlee 
In my opinion, list comprehension can reduce code length  to some extent. 
It's better if the agg method can support the  easy way in api level.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #15135: [pyspark][group]pyspark GroupedData can't apply a...

2016-09-18 Thread citoubest
GitHub user citoubest opened a pull request:

https://github.com/apache/spark/pull/15135

[pyspark][group]pyspark GroupedData can't apply agg functions on all left 
numeric columns.

## What changes were proposed in this pull request?
With pyspark dataframe, the agg method just support two ways, one is to 
give the column and agg method maps and another one is to use agg functions in 
package functions  to apply on specific columns names. The two approach both 
ask us to asign a method on specific columns names. But if
I want to apply the agg method on all other numeric columns, I should list 
all method-column combinations. such as, suppose the df has to columns 
province,age,income, I want to groupby the province and calculate the min, max 
and average values on age and income, before change I have to approach: 
df.groupby('province').agg({'age':'max','age':'min','age':'avg','income':'max','income':'min','income':'avg'})

df.groupby('province').agg({F.min('age'),F.max('age'),F.avg('age'),F.min('income'),F.max('income'),F.avg('income')}),
 which are both redundant.

with this change, we can simply replace the code with 
df.groupby('province').agg('max','min','avg')

## How was this patch tested?
manual tests


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/citoubest/spark master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/15135.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #15135


commit 67e75a201bd85058e2b5a843b061e00cb3856087
Author: citoubest <1206539...@qq.com>
Date:   2016-09-18T05:37:40Z

pyspark dataframe agg not support multiple functions for all columns, 
change group.agg to support df.groupby(name).agg(max,min) like pandas

commit 7407bc84b0376c78a7816de349a607cecac1d6f4
Author: citoubest <1206539...@qq.com>
Date:   2016-09-18T05:39:25Z

add comment for last change




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org