Re: countDistinct, partial aggregates and Spark 2.0

Lee Becker Fri, 12 Aug 2016 11:14:38 -0700

On Fri, Aug 12, 2016 at 11:55 AM, Lee Becker <lee.bec...@hapara.com> wrote:


> val df = sc.parallelize(Array(("a", "a"), ("b", "c"), ("c",
> "a"))).toDF("x", "y")
> val grouped = df.groupBy($"x").agg(countDistinct($"y"), collect_set($"y"))
>

This workaround executes with no exceptions:
val grouped = df.groupBy($"x").agg(size(collect_set($"y")),
collect_set($"y"))

In this example countDistinct and collect_set are running on the same
column and thus the result of countDistinct is essentially redundant.
Assuming they were running on different columns (say there was column 'z'
too), is there anything distinct computationally between countDistinct and
size(collect_set(...))?

-- 
*hapara* ● Making Learning Visible
1877 Broadway Street, Boulder, CO 80302
(Google Voice): +1 720 335 5332
www.hapara.com   Twitter: @hapara_team <http://twitter.com/hapara_team>

Re: countDistinct, partial aggregates and Spark 2.0

Reply via email to