How about using accumulators <http://spark.apache.org/docs/1.2.0/programming-guide.html#accumulators>?
Thanks Best Regards On Wed, Jan 21, 2015 at 12:53 PM, Tobias Pfeiffer <t...@preferred.jp> wrote: > Hi, > > I am developing a Spark Streaming application where I want every item in > my stream to be assigned a unique, strictly increasing Long. My input data > already has RDD-local integers (from 0 to N-1) assigned, so I am doing the > following: > > var totalNumberOfItems = 0L > // update the keys of the stream data > val globallyIndexedItems = inputStream.map(keyVal => > (keyVal._1 + totalNumberOfItems, keyVal._2)) > // increase the number of total seen items > inputStream.foreachRDD(rdd => { > totalNumberOfItems += rdd.count > }) > > Now this works on my local[*] Spark instance, but I was wondering if this > is actually an ok thing to do. I don't want this to break when going to a > YARN cluster... > > The function increasing totalNumberOfItems is closing over a var and > running in the driver, so I think this is ok. Here is my concern: What > about the function in the inputStream.map(...) block? This one is closing > over a var that has a different value in every interval. Will the closure > be serialized with that new value in every interval? Or only once with the > initial value and this will always be 0 during the runtime of the program? > > As I said, it works locally, but I was wondering if I can really assume > that the closure is serialized with a new value in every interval. > > Thanks, > Tobias > >