[GitHub] spark issue #15135: [pyspark][group]pyspark GroupedData can't apply agg func...
Github user citoubest commented on the issue: https://github.com/apache/spark/pull/15135 @davies, what do you think about this patch? Can you give me some advice? Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15135: [pyspark][group]pyspark GroupedData can't apply agg func...
Github user citoubest commented on the issue: https://github.com/apache/spark/pull/15135 with pandas, the param for agg is the function not a str (function names). In [13]: df Out[13]: a b c d 0 0.068300 0.263883 0.237335 1 1 0.226992 0.573966 0.954791 2 2 0.907550 0.930591 0.886454 1 3 0.178581 0.440734 0.414763 2 In [14]: df.groupby('d').agg([max,min]) Out[14]: a b c max min max min max min d 1 0.907550 0.068300 0.930591 0.263883 0.886454 0.237335 2 0.226992 0.178581 0.573966 0.440734 0.954791 0.414763 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15135: [pyspark][group]pyspark GroupedData can't apply agg func...
Github user citoubest commented on the issue: https://github.com/apache/spark/pull/15135 OK, because pandas dataframe support the added approach to agg, so I suppose maybe spark dataframe should support, but it not. So I have tried to add this patch. If you think this patch is not necessary , I will close this request later. @rxin . --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15135: [pyspark][group]pyspark GroupedData can't apply agg func...
Github user citoubest commented on the issue: https://github.com/apache/spark/pull/15135 @rxin @davies @srowen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #15135: [pyspark][group]pyspark GroupedData can't apply agg func...
Github user citoubest commented on the issue: https://github.com/apache/spark/pull/15135 @petermaxlee In my opinion, list comprehension can reduce code length to some extent. It's better if the agg method can support the easy way in api level. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #15135: [pyspark][group]pyspark GroupedData can't apply a...
GitHub user citoubest opened a pull request: https://github.com/apache/spark/pull/15135 [pyspark][group]pyspark GroupedData can't apply agg functions on all left numeric columns. ## What changes were proposed in this pull request? With pyspark dataframe, the agg method just support two ways, one is to give the column and agg method maps and another one is to use agg functions in package functions to apply on specific columns names. The two approach both ask us to asign a method on specific columns names. But if I want to apply the agg method on all other numeric columns, I should list all method-column combinations. such as, suppose the df has to columns province,age,income, I want to groupby the province and calculate the min, max and average values on age and income, before change I have to approach: df.groupby('province').agg({'age':'max','age':'min','age':'avg','income':'max','income':'min','income':'avg'}) df.groupby('province').agg({F.min('age'),F.max('age'),F.avg('age'),F.min('income'),F.max('income'),F.avg('income')}), which are both redundant. with this change, we can simply replace the code with df.groupby('province').agg('max','min','avg') ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/citoubest/spark master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/15135.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #15135 commit 67e75a201bd85058e2b5a843b061e00cb3856087 Author: citoubest <1206539...@qq.com> Date: 2016-09-18T05:37:40Z pyspark dataframe agg not support multiple functions for all columns, change group.agg to support df.groupby(name).agg(max,min) like pandas commit 7407bc84b0376c78a7816de349a607cecac1d6f4 Author: citoubest <1206539...@qq.com> Date: 2016-09-18T05:39:25Z add comment for last change --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org