Hi, I am developing a Spark Streaming application where I want every item in my stream to be assigned a unique, strictly increasing Long. My input data already has RDD-local integers (from 0 to N-1) assigned, so I am doing the following:
var totalNumberOfItems = 0L // update the keys of the stream data val globallyIndexedItems = inputStream.map(keyVal => (keyVal._1 + totalNumberOfItems, keyVal._2)) // increase the number of total seen items inputStream.foreachRDD(rdd => { totalNumberOfItems += rdd.count }) Now this works on my local[*] Spark instance, but I was wondering if this is actually an ok thing to do. I don't want this to break when going to a YARN cluster... The function increasing totalNumberOfItems is closing over a var and running in the driver, so I think this is ok. Here is my concern: What about the function in the inputStream.map(...) block? This one is closing over a var that has a different value in every interval. Will the closure be serialized with that new value in every interval? Or only once with the initial value and this will always be 0 during the runtime of the program? As I said, it works locally, but I was wondering if I can really assume that the closure is serialized with a new value in every interval. Thanks, Tobias