On Fri, Aug 12, 2016 at 11:55 AM, Lee Becker <lee.bec...@hapara.com> wrote:

> val df = sc.parallelize(Array(("a", "a"), ("b", "c"), ("c",
> "a"))).toDF("x", "y")
> val grouped = df.groupBy($"x").agg(countDistinct($"y"), collect_set($"y"))
>

This workaround executes with no exceptions:
val grouped = df.groupBy($"x").agg(size(collect_set($"y")),
collect_set($"y"))

In this example countDistinct and collect_set are running on the same
column and thus the result of countDistinct is essentially redundant.
Assuming they were running on different columns (say there was column 'z'
too), is there anything distinct computationally between countDistinct and
size(collect_set(...))?

-- 
*hapara* ● Making Learning Visible
1877 Broadway Street, Boulder, CO 80302
(Google Voice): +1 720 335 5332
www.hapara.com   Twitter: @hapara_team <http://twitter.com/hapara_team>

Reply via email to