Hey Diederik,

The data in rdd._jrdd.rdd() is serialized by pickle in batch mode by default,
so the number of rows in it is much less then rdd. for example:

>>> size = 100
>>> d = [i%size for i in range(1, 100000)]
>>> rdd = sc.parallelize(d)
>>> rdd.count()
99999
>>> rdd._jrdd.rdd().count()
98L
>>> rdd._jrdd.rdd().countApproxDistinct(4,0)
29L
>>> rdd._jrdd.rdd().countApproxDistinct(8,0)
24L

In order to call countApproxDistinct() in Scala, you need to disable
batch mode serialization

>>> from pyspark.serializers import PickleSerializer
>>> sc.serializer = PickleSerializer()
>>> rdd = rdd.map(lambda x:x)  # change serializer
>>> rdd._jrdd.rdd().count()
99999L
>>> rdd._jrdd.rdd().countApproxDistinct(4, 0)
98L
>>> rdd._jrdd.rdd().countApproxDistinct(8, 0)
103L

Davies


On Tue, Jul 29, 2014 at 11:45 AM, Diederik <dvanli...@gmail.com> wrote:
> Heya,
>
> I would like to use countApproxDistinct in pyspark, I know that it's an
> experimental method and that it is not yet available in pyspark. I started
> with porting the countApproxDistinct unit-test to Python, see
> https://gist.github.com/drdee/d68eaf0208184d72cbff. Surprisingly, the
> results are way off.
>
> Using Scala, I get the following two counts (using
> https://github.com/apache/spark/blob/4c7243e109c713bdfb87891748800109ffbaae07/core/src/test/scala/org/apache/spark/rdd/RDDSuite.scala#L78-87):
>
> scala> simpleRdd.countApproxDistinct(4, 0)
> res2: Long = 73
>
> scala> simpleRdd.countApproxDistinct(8, 0)
> res3: Long = 99
>
> In Python, with the same RDD as you can see in the gist, I get the following
> results:
>
> In [7]: rdd._jrdd.rdd().countApproxDistinct(4, 0)
> Out[7]: 29L
>
> In [8]: rdd._jrdd.rdd().countApproxDistinct(8, 0)
> Out[8]: 26L
>
>
> Clearly, I am doing something wrong here :) What is also weird is that when
> I set p to 8, I should get a more accurate number, but it's actually
> smaller. Any tips or pointers are much appreciated!
> Best,
> Diederik
>
>
>
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-countApproxDistinct-in-pyspark-tp10878.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to