How about using accumulators
<http://spark.apache.org/docs/1.2.0/programming-guide.html#accumulators>?

Thanks
Best Regards

On Wed, Jan 21, 2015 at 12:53 PM, Tobias Pfeiffer <t...@preferred.jp> wrote:

> Hi,
>
> I am developing a Spark Streaming application where I want every item in
> my stream to be assigned a unique, strictly increasing Long. My input data
> already has RDD-local integers (from 0 to N-1) assigned, so I am doing the
> following:
>
>   var totalNumberOfItems = 0L
>   // update the keys of the stream data
>   val globallyIndexedItems = inputStream.map(keyVal =>
>       (keyVal._1 + totalNumberOfItems, keyVal._2))
>   // increase the number of total seen items
>   inputStream.foreachRDD(rdd => {
>     totalNumberOfItems += rdd.count
>   })
>
> Now this works on my local[*] Spark instance, but I was wondering if this
> is actually an ok thing to do. I don't want this to break when going to a
> YARN cluster...
>
> The function increasing totalNumberOfItems is closing over a var and
> running in the driver, so I think this is ok. Here is my concern: What
> about the function in the inputStream.map(...) block? This one is closing
> over a var that has a different value in every interval. Will the closure
> be serialized with that new value in every interval? Or only once with the
> initial value and this will always be 0 during the runtime of the program?
>
> As I said, it works locally, but I was wondering if I can really assume
> that the closure is serialized with a new value in every interval.
>
> Thanks,
> Tobias
>
>

Reply via email to