[SparkSQL] Count Distinct issue

Daniele Foroni Fri, 14 Sep 2018 11:54:34 -0700

Hi all,

I am having some troubles in doing a count distinct over multiple columns.
This is an example of my data:
+----+----+----+---+
|a   |b   |c   |d  |
+----+----+----+---+
|null|null|null|1  |
|null|null|null|2  |
|null|null|null|3  |
|null|null|null|4  |
|null|null|null|5  |
|null|null|null|6  |
|null|null|null|7  |
+----+----+----+---+
And my code:
val df: Dataset[Row] = …
val cols: List[Column] = df.columns.map(col).toList
df.agg(countDistinct(cols.head, cols.tail: _*))


So, in the example above, if I count the distinct “rows” I obtain 7 as result 
as expected (since the “d" column changes for every row).
However, with more columns (16) in EXACTLY the same situation (one incremental 
column and 15 columns filled with nulls) the result is 0.

I don’t understand why I am experiencing this problem.
Any solution?

Thanks,
---
Daniele


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[SparkSQL] Count Distinct issue

Reply via email to