I am experimenting with countApprox. I created a RDD of 10^8 numbers and ran countApprox with different parameters but I failed to generate any approximate output. In all runs it returns the exact number of elements. What is the effect of approximation in countApprox supposed to be, and for what inputs and parameters?
>>> rdd = sc.parallelize([random.choice(range(1000)) for i in range(10**8)], 50) >>> rdd.countApprox(1, 0.8) [Stage 12:> (0 + 0) / 50]16/09/15 15:45:28 WARN TaskSetManager: Stage 12 contains a task of very large size (5402 KB). The maximum recommended task size is 100 KB. [Stage 12:======================================================> (49 + 1) / 50]100000000 >>> rdd.countApprox(1, 0.01) 16/09/15 15:45:45 WARN TaskSetManager: Stage 13 contains a task of very large size (5402 KB). The maximum recommended task size is 100 KB. [Stage 13:====================================================> (47 + 3) / 50]100000000