Hi all, My company is using Spark streaming and the Kafka API's to process an event stream. We've got most of our application written, but are stuck on "at least once" processing.
I created a demo to show roughly what we're doing here: https://github.com/bitborn/resilient-kafka-streaming-in-spark <https://github.com/bitborn/resilient-kafka-streaming-in-spark> The problem we're having is when the application experiences an exception (network issue, out of memory, etc) it will drop the batch it's processing. The ideal behavior is it will process each event "at least once" even if that means processing it more than once. Whether this happens via checkpointing, WAL, or kafka offsets is irrelevant, as long as we don't drop data. :) A couple of things we've tried: - Using the kafka direct stream API (via Cody Koeninger <https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/IdempotentExample.scala> ) - Using checkpointing with both the low-level and high-level API's - Enabling the write ahead log I've included a log here spark.log <https://github.com/bitborn/resilient-kafka-streaming-in-spark/blob/master/spark.log> , but I'm afraid it doesn't reveal much. The fact that others seem to be able to get this working properly suggests we're missing some magic configuration or are possibly executing it in a way that won't support the desired behavior. I'd really appreciate some pointers! Thanks much, Andrew Clarkson -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-at-least-once-semantics-tp24995.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org