Hi TD, I've also been fighting this issue only to find the exact same solution you are suggesting. Too bad I didn't find either the post or the issue sooner.
I'm using a 1 second batch with N amount of kafka events (1 to 1 with the state objects) per batch and only calling the updatestatebykey function. This is my interpretation, please correct me if needed: Because of Spark’s lazy computation the RDDs weren’t being updated as expected on the batch interval execution. The assumption was that as long as I have a streaming batch run (with or without new messages), I should get updated RDDs, which was not happening. We only get updateStateByKey calls for objects which got events or that are forced through an output function to compute. I did not make further test to confirm this, but that's the given impression. This doesn't fit our requirements as we want to do duration updates based on the batch interval execution...so I had to force the computation of all the objects through the ForeachRDD function. I will also appreciate if the priority can be increased to the issue. I assume the ForeachRDD is additional unnecessary resource allocation (although I'm not sure how much) as opposite to doing it somehow by default on batch interval execution. tnks, Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streaming-window-not-behaving-as-advertised-v1-0-1-tp10453p11168.html Sent from the Apache Spark User List mailing list archive at Nabble.com.