exact count using rdd.count()?

2014-10-27 Thread Josh J
Hi,

Is the following guaranteed to always provide an exact count?

foreachRDD(foreachFunc = rdd = {
   rdd.count()

In the literature it mentions However, output operations (like foreachRDD)
have *at-least once* semantics, that is, the transformed data may get
written to an external entity more than once in the event of a worker
failure.

http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node

Thanks,
Josh


Re: exact count using rdd.count()?

2014-10-27 Thread Holden Karau
Hi Josh,

The count() call will result in the correct number in each RDD, however
foreachRDD doesn't return the result of its computation anywhere (its
intended for things which cause side effects, like updating an accumulator
or causing an web request), you might want to look at transform or the
count function its self on the DStream.
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.DStream

Cheers,

Holden :)

On Mon, Oct 27, 2014 at 1:29 PM, Josh J joshjd...@gmail.com wrote:

 Hi,

 Is the following guaranteed to always provide an exact count?

 foreachRDD(foreachFunc = rdd = {
rdd.count()

 In the literature it mentions However, output operations (like foreachRDD)
 have *at-least once* semantics, that is, the transformed data may get
 written to an external entity more than once in the event of a worker
 failure.


 http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node

 Thanks,
 Josh




-- 
Cell : 425-233-8271