GitHub user citoubest opened a pull request:
https://github.com/apache/spark/pull/15135
[pyspark][group]pyspark GroupedData can't apply agg functions on all left
numeric columns.
## What changes were proposed in this pull request?
With pyspark dataframe, the agg method just support two ways, one is to
give the column and agg method maps and another one is to use agg functions in
package functions to apply on specific columns names. The two approach both
ask us to asign a method on specific columns names. But if
I want to apply the agg method on all other numeric columns, I should list
all method-column combinations. such as, suppose the df has to columns
province,age,income, I want to groupby the province and calculate the min, max
and average values on age and income, before change I have to approach:
df.groupby('province').agg({'age':'max','age':'min','age':'avg','income':'max','income':'min','income':'avg'})
df.groupby('province').agg({F.min('age'),F.max('age'),F.avg('age'),F.min('income'),F.max('income'),F.avg('income')}),
which are both redundant.
with this change, we can simply replace the code with
df.groupby('province').agg('max','min','avg')
## How was this patch tested?
manual tests
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/citoubest/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15135.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15135
commit 67e75a201bd85058e2b5a843b061e00cb3856087
Author: citoubest <1206539...@qq.com>
Date: 2016-09-18T05:37:40Z
pyspark dataframe agg not support multiple functions for all columns,
change group.agg to support df.groupby(name).agg(max,min) like pandas
commit 7407bc84b0376c78a7816de349a607cecac1d6f4
Author: citoubest <1206539...@qq.com>
Date: 2016-09-18T05:39:25Z
add comment for last change
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org