I paste this right from Spark shell (Spark 2.1.0):
/scala> spark.sql("SELECT count(distinct col) FROM Table").show()// //+-------------------------+ // //|count(DISTINCT col)|// //+-------------------------+// //| 4697 |// //+-------------------------+// //scala> spark.sql("SELECT distinct col FROM Table").count()// //res8: Long = 4698 /That is, `dataframe.count()` is returning one more count that the in-query `COUNT()` function.
Any explanations? Cheers, Mohamed