Kafka streaming "at least once" semantics

bitborn Fri, 09 Oct 2015 04:34:48 -0700

Hi all,

My company is using Spark streaming and the Kafka API's to process an event
stream. We've got most of our application written, but are stuck on "at
least once" processing.


I created a demo to show roughly what we're doing here: 
https://github.com/bitborn/resilient-kafka-streaming-in-spark
<https://github.com/bitborn/resilient-kafka-streaming-in-spark>  

The problem we're having is when the application experiences an exception
(network issue, out of memory, etc) it will drop the batch it's processing.
The ideal behavior is it will process each event "at least once" even if
that means processing it more than once. Whether this happens via
checkpointing, WAL, or kafka offsets is irrelevant, as long as we don't drop
data. :)

A couple of things we've tried:
- Using the kafka direct stream API (via  Cody Koeninger
<https://github.com/koeninger/kafka-exactly-once/blob/master/src/main/scala/example/IdempotentExample.scala>
 
)
- Using checkpointing with both the low-level and high-level API's
- Enabling the write ahead log

I've included a log here  spark.log
<https://github.com/bitborn/resilient-kafka-streaming-in-spark/blob/master/spark.log>
 
, but I'm afraid it doesn't reveal much.

The fact that others seem to be able to get this working properly suggests
we're missing some magic configuration or are possibly executing it in a way
that won't support the desired behavior.

I'd really appreciate some pointers!

Thanks much,
Andrew Clarkson





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-at-least-once-semantics-tp24995.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Kafka streaming "at least once" semantics

Reply via email to