Hey Diederik, The data in rdd._jrdd.rdd() is serialized by pickle in batch mode by default, so the number of rows in it is much less then rdd. for example:
>>> size = 100 >>> d = [i%size for i in range(1, 100000)] >>> rdd = sc.parallelize(d) >>> rdd.count() 99999 >>> rdd._jrdd.rdd().count() 98L >>> rdd._jrdd.rdd().countApproxDistinct(4,0) 29L >>> rdd._jrdd.rdd().countApproxDistinct(8,0) 24L In order to call countApproxDistinct() in Scala, you need to disable batch mode serialization >>> from pyspark.serializers import PickleSerializer >>> sc.serializer = PickleSerializer() >>> rdd = rdd.map(lambda x:x) # change serializer >>> rdd._jrdd.rdd().count() 99999L >>> rdd._jrdd.rdd().countApproxDistinct(4, 0) 98L >>> rdd._jrdd.rdd().countApproxDistinct(8, 0) 103L Davies On Tue, Jul 29, 2014 at 11:45 AM, Diederik <dvanli...@gmail.com> wrote: > Heya, > > I would like to use countApproxDistinct in pyspark, I know that it's an > experimental method and that it is not yet available in pyspark. I started > with porting the countApproxDistinct unit-test to Python, see > https://gist.github.com/drdee/d68eaf0208184d72cbff. Surprisingly, the > results are way off. > > Using Scala, I get the following two counts (using > https://github.com/apache/spark/blob/4c7243e109c713bdfb87891748800109ffbaae07/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala#L78-87): > > scala> simpleRdd.countApproxDistinct(4, 0) > res2: Long = 73 > > scala> simpleRdd.countApproxDistinct(8, 0) > res3: Long = 99 > > In Python, with the same RDD as you can see in the gist, I get the following > results: > > In [7]: rdd._jrdd.rdd().countApproxDistinct(4, 0) > Out[7]: 29L > > In [8]: rdd._jrdd.rdd().countApproxDistinct(8, 0) > Out[8]: 26L > > > Clearly, I am doing something wrong here :) What is also weird is that when > I set p to 8, I should get a more accurate number, but it's actually > smaller. Any tips or pointers are much appreciated! > Best, > Diederik > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Using-countApproxDistinct-in-pyspark-tp10878.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.