Context: Process data coming from Kafka and send back results to Kafka.

Issue: Each events could take several seconds to process (Work in progress
to improve that). During that time, events (and RDD) do accumulate.
Intermediate events (by key) do not have to be processed, only the last
ones. So when one process finished it would be ideal that Spark Streaming
skip all events that are not the current last ones (by key).

I'm not sure that the solution could be done using only Spark Streaming
API. As I understand Spark Streaming, DStream RDD will accumulate and be
processed one by one and do not considerate if there are others afterwards.

Possible solutions:

Using only Spark Streaming API but I'm not sure how. updateStateByKey seems
to be a solution. But I'm not sure that it will work properly when DStream
RDD accumulate and you have to only process lasts events by key.

Have two Spark Streaming pipelines. One to get last updated event by key,
store that in a map or a database. The second pipeline processes events
only if they are the last ones as indicate by the other pipeline.


Sub questions for the second solution:

Could two pipelines share the same sparkStreamingContext and process the
same DStream at different speed (low processing vs high)?

Is it easily possible to share values (map for example) between pipelines
without using an external database? I think accumulator/broadcast could
work but between two pipelines I'm not sure.

Regards,

Julien Naour

Reply via email to