Re: [pyspark 2.3+] CountDistinct

2019-07-01 Thread Abdeali Kothari
I can't exactly reproduce this. Here is what I tried quickly: import uuid import findspark findspark.init() # noqa import pyspark from pyspark.sql import functions as F # noqa: N812 spark = pyspark.sql.SparkSession.builder.getOrCreate() df = spark.createDataFrame([ [str(uuid.uuid4()) for

Re: [pyspark 2.3+] CountDistinct

2019-06-29 Thread Rishi Shah
Thanks Abdeali! Please find details below: df.agg(countDistinct(col('col1'))).show() --> 450089 df.agg(countDistinct(col('col1'))).show() --> 450076 df.filter(col('col1').isNull()).count() --> 0 df.filter(col('col1').isNotNull()).count() --> 450063 col1 is a string Spark version 2.4.0 datasize:

Re: [pyspark 2.3+] CountDistinct

2019-06-29 Thread Abdeali Kothari
How large is the data frame and what data type are you counting distinct for? I use count distinct quite a bit and haven't noticed any thing peculiar. Also, which exact version in 2.3.x? And, are performing any operations on the DF before the countDistinct? I recall there was a bug when I did

Re: [pyspark 2.3+] CountDistinct

2019-06-28 Thread Rishi Shah
Hi All, Just wanted to check in to see if anyone has any insight about this behavior. Any pointers would help. Thanks, Rishi On Fri, Jun 14, 2019 at 7:05 AM Rishi Shah wrote: > Hi All, > > Recently we noticed that countDistinct on a larger dataframe doesn't > always return the same value. Any

[pyspark 2.3+] CountDistinct

2019-06-14 Thread Rishi Shah
Hi All, Recently we noticed that countDistinct on a larger dataframe doesn't always return the same value. Any idea? If this is the case then what is the difference between countDistinct & approx_count_distinct? -- Regards, Rishi Shah