how to use rdd.countApprox

Du Li Wed, 06 May 2015 07:55:54 -0700

I have to count RDD's in a spark streaming app. When data goes large, count() 
becomes expensive. Did anybody have experience using countApprox()? How 
accurate/reliable is it? 
The documentation is pretty modest. Suppose the timeout parameter is in 
milliseconds. Can I retrieve the count value by calling getFinalValue()? Does 
it block and return only after the timeout? Or do I need to define 
onComplete/onFail handlers to extract count value from the partial results?
Thanks,Du

how to use rdd.countApprox

Reply via email to