I can't exactly reproduce this. Here is what I tried quickly:
import uuid
import findspark
findspark.init() # noqa
import pyspark
from pyspark.sql import functions as F # noqa: N812
spark = pyspark.sql.SparkSession.builder.getOrCreate()
df = spark.createDataFrame([
[str(uuid.uuid4()) for
Thanks Abdeali! Please find details below:
df.agg(countDistinct(col('col1'))).show() --> 450089
df.agg(countDistinct(col('col1'))).show() --> 450076
df.filter(col('col1').isNull()).count() --> 0
df.filter(col('col1').isNotNull()).count() --> 450063
col1 is a string
Spark version 2.4.0
datasize:
How large is the data frame and what data type are you counting distinct
for?
I use count distinct quite a bit and haven't noticed any thing peculiar.
Also, which exact version in 2.3.x?
And, are performing any operations on the DF before the countDistinct?
I recall there was a bug when I did
Hi All,
Just wanted to check in to see if anyone has any insight about this
behavior. Any pointers would help.
Thanks,
Rishi
On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah wrote:
> Hi All,
>
> Recently we noticed that countDistinct on a larger dataframe doesn't
> always return the same value. Any
Hi All,
Recently we noticed that countDistinct on a larger dataframe doesn't always
return the same value. Any idea? If this is the case then what is the
difference between countDistinct & approx_count_distinct?
--
Regards,
Rishi Shah