I am experimenting with countApprox. I created a RDD of 10^8 numbers and ran 
countApprox with different parameters but I failed to generate any approximate 
output. In all runs it returns the exact number of elements. What is the effect 
of approximation in countApprox supposed to be, and for what inputs and 
parameters?

>>> rdd = sc.parallelize([random.choice(range(1000)) for i in range(10**8)], 50)
>>> rdd.countApprox(1, 0.8)
[Stage 12:>                                                        (0 + 0) / 
50]16/09/15 15:45:28 WARN TaskSetManager: Stage 12 contains a task of very 
large size (5402 KB). The maximum recommended task size is 100 KB.
[Stage 12:======================================================> (49 + 1) / 
50]100000000
>>> rdd.countApprox(1, 0.01)
16/09/15 15:45:45 WARN TaskSetManager: Stage 13 contains a task of very large 
size (5402 KB). The maximum recommended task size is 100 KB.
[Stage 13:====================================================>   (47 + 3) / 
50]100000000

Reply via email to