I have to count RDD's in a spark streaming app. When data goes large, count() becomes expensive. Did anybody have experience using countApprox()? How accurate/reliable is it? The documentation is pretty modest. Suppose the timeout parameter is in milliseconds. Can I retrieve the count value by calling getFinalValue()? Does it block and return only after the timeout? Or do I need to define onComplete/onFail handlers to extract count value from the partial results? Thanks,Du
- how to use rdd.countApprox Du Li
- Re: how to use rdd.countApprox Du Li
- Re: how to use rdd.countApprox Tathagata Das
- Re: how to use rdd.countApprox Du Li
- Re: how to use rdd.countApprox Du Li
- Re: how to use rdd.countApprox Tathagata Das
- Re: how to use rdd.countApprox Du Li
- Re: how to use rdd.countApprox Tathagata Das
- Re: how to use rdd.countApprox Du Li
- Re: how to use rdd.countApprox Tathagata Das
- Re: how to use rdd.countApprox Du Li