On Fri, Aug 12, 2016 at 11:55 AM, Lee Becker <lee.bec...@hapara.com> wrote:
> val df = sc.parallelize(Array(("a", "a"), ("b", "c"), ("c", > "a"))).toDF("x", "y") > val grouped = df.groupBy($"x").agg(countDistinct($"y"), collect_set($"y")) > This workaround executes with no exceptions: val grouped = df.groupBy($"x").agg(size(collect_set($"y")), collect_set($"y")) In this example countDistinct and collect_set are running on the same column and thus the result of countDistinct is essentially redundant. Assuming they were running on different columns (say there was column 'z' too), is there anything distinct computationally between countDistinct and size(collect_set(...))? -- *hapara* ● Making Learning Visible 1877 Broadway Street, Boulder, CO 80302 (Google Voice): +1 720 335 5332 www.hapara.com Twitter: @hapara_team <http://twitter.com/hapara_team>