Hi Josh, The count() call will result in the correct number in each RDD, however foreachRDD doesn't return the result of its computation anywhere (its intended for things which cause side effects, like updating an accumulator or causing an web request), you might want to look at transform or the count function its self on the DStream. http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.streaming.dstream.DStream
Cheers, Holden :) On Mon, Oct 27, 2014 at 1:29 PM, Josh J <joshjd...@gmail.com> wrote: > Hi, > > Is the following guaranteed to always provide an exact count? > > foreachRDD(foreachFunc = rdd => { > rdd.count() > > In the literature it mentions "However, output operations (like foreachRDD) > have *at-least once* semantics, that is, the transformed data may get > written to an external entity more than once in the event of a worker > failure." > > > http://spark.apache.org/docs/latest/streaming-programming-guide.html#failure-of-a-worker-node > > Thanks, > Josh > -- Cell : 425-233-8271