[
https://issues.apache.org/jira/browse/SPARK-3312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14341727#comment-14341727
]
Sean Owen commented on SPARK-3312:
--
Interesting, is the reduce / max / min in question here by key? We have the
{{stats()}} method for RDDs of {{Double}} already to take care of this for a
whole RDD. Rather than add an API method for the by-key case, it's possible to
use {{StatCounter}} to compute all of these at once over a bunch of values that
have been collected by key. Does that do the trick or is this something more?
Add a groupByKey which returns a special GroupBy object like in pandas
--
Key: SPARK-3312
URL: https://issues.apache.org/jira/browse/SPARK-3312
Project: Spark
Issue Type: Improvement
Components: Spark Core
Reporter: holdenk
Priority: Minor
A common pattern which causes problems for new Spark users is using
groupByKey followed by a reduce. I'd like to make a special version of
groupByKey which returns a groupBy object (like the Panda's groupby object).
The resulting class would have a number of functions (min,max, stats, reduce)
which could all be implemented efficiently.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org