Error recovery strategies using the DirectKafkaInputDStream

2015-05-14 Thread badgerpants
We've been using the new DirectKafkaInputDStream to implement an exactly once processing solution that tracks the provided offset ranges within the same transaction that persists our data results. When an exception is thrown within the processing loop and the configured number of retries are

Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-04-30 Thread badgerpants
Cody Koeninger-2 wrote What's your schema for the offset table, and what's the definition of writeOffset ? The schema is the same as the one in your post: topic | partition| offset The writeOffset is nearly identical: def writeOffset(osr: OffsetRange)(implicit session: DBSession): Unit = {

practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-04-30 Thread badgerpants
We're a group of experienced backend developers who are fairly new to Spark Streaming (and Scala) and very interested in using the new (in 1.3) DirectKafkaInputDStream impl as part of the metrics reporting service we're building. Our flow involves reading in metric events, lightly modifying some

Re: practical usage of the new exactly-once supporting DirectKafkaInputDStream

2015-04-30 Thread badgerpants
Cody Koeninger-2 wrote In fact, you're using the 2 arg form of reduce by key to shrink it down to 1 partition reduceByKey(sumFunc, 1) But you started with 4 kafka partitions? So they're definitely no longer 1:1 True. I added the second arg because we were seeing multiple threads