Context: Process data coming from Kafka and send back results to Kafka. Issue: Each events could take several seconds to process (Work in progress to improve that). During that time, events (and RDD) do accumulate. Intermediate events (by key) do not have to be processed, only the last ones. So when one process finished it would be ideal that Spark Streaming skip all events that are not the current last ones (by key).
I'm not sure that the solution could be done using only Spark Streaming API. As I understand Spark Streaming, DStream RDD will accumulate and be processed one by one and do not considerate if there are others afterwards. Possible solutions: Using only Spark Streaming API but I'm not sure how. updateStateByKey seems to be a solution. But I'm not sure that it will work properly when DStream RDD accumulate and you have to only process lasts events by key. Have two Spark Streaming pipelines. One to get last updated event by key, store that in a map or a database. The second pipeline processes events only if they are the last ones as indicate by the other pipeline. Sub questions for the second solution: Could two pipelines share the same sparkStreamingContext and process the same DStream at different speed (low processing vs high)? Is it easily possible to share values (map for example) between pipelines without using an external database? I think accumulator/broadcast could work but between two pipelines I'm not sure. Regards, Julien Naour