good question, soumitra. it's a bit confusing. to break TD's code down a bit:
dstream.count() is a transformation operation (returns a new DStream), executes lazily, runs in the cluster on the underlying RDDs that come through in that batch, and returns a new DStream with a single element representing the count of the underlying RDDs in each batch. dstream.foreachRDD() is an output/action operation (returns something other than a DStream - nothing in this case), triggers the lazy execution above, returns the results to the driver, and increments the globalCount locally in the driver. per your specific question, RDD.count() is different in that it's an output/action operation that materializes the RDD and collects the count of elements in the RDD locally in the driver. confusing, indeed. accumulators updated in parallel on the worker nodes across the cluster and are read locally in the driver. On Fri, Aug 8, 2014 at 7:36 AM, Soumitra Kumar <kumar.soumi...@gmail.com> wrote: > I want to keep track of the events processed in a batch. > > How come 'globalCount' work for DStream? I think similar construct won't > work for RDD, that's why there is accumulator. > > > On Fri, Aug 8, 2014 at 12:52 AM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> Do you mean that you want a continuously updated count as more >> events/records are received in the DStream (remember, DStream is a >> continuous stream of data)? Assuming that is what you want, you can use a >> global counter >> >> var globalCount = 0L >> >> dstream.count().foreachRDD(rdd => { globalCount += rdd.first() } ) >> >> This globalCount variable will reside in the driver and will keep being >> updated after every batch. >> >> TD >> >> >> On Thu, Aug 7, 2014 at 10:16 PM, Soumitra Kumar <kumar.soumi...@gmail.com >> > wrote: >> >>> Hello, >>> >>> I want to count the number of elements in the DStream, like RDD.count() >>> . Since there is no such method in DStream, I thought of using >>> DStream.count and use the accumulator. >>> >>> How do I do DStream.count() to count the number of elements in a DStream? >>> >>> How do I create a shared variable in Spark Streaming? >>> >>> -Soumitra. >>> >> >> >