Take a look at https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md if you haven't already.
If you're fine with saving offsets yourself, I'd stick with KafkaRDD, as Koert said. I haven't tried 2 hour stream batch durations, so I can't vouch for using createDirectStream in that case. But if you really don't want to manage saving offsets yourself, you can try it (along with enabling checkpointing). Let us know if it works out for you. On Sat, Apr 18, 2015 at 2:17 PM, Koert Kuipers <ko...@tresata.com> wrote: > I mean to say it is simpler in case of failures, restarts, upgrades, etc. > Not just failures. > > But they did do a lot of work on streaming from kafka in spark 1.3.x to > make it simpler (streaming simple calls KafkaRDD for every batch if you use > KafkaUtils.createDirectStream), so maybe i am wrong and streaming is just > as good an approach. Not sure... > > On Sat, Apr 18, 2015 at 3:13 PM, Koert Kuipers <ko...@tresata.com> wrote: > >> Yeah I think would pick the second approach because it is simpler >> operationally in case of any failures. But of course the smaller the window >> gets the more attractive the streaming solution gets. >> >> We do daily extracts, not every 2 hours. >> >> On Sat, Apr 18, 2015 at 2:57 PM, Shushant Arora < >> shushantaror...@gmail.com> wrote: >> >>> Thanks Koert. >>> >>> So in short for Highlevel api I ll have to go with spark streaming only >>> and there the issue is of handling cluster restart , thats why you opted >>> for second approach of batch job or due to batch interval (2 hours is large >>> for stream job) or some other reason? >>> >>> >>> >>> On Sun, Apr 19, 2015 at 12:20 AM, Koert Kuipers <ko...@tresata.com> >>> wrote: >>> >>>> KafkaRDD uses the simple consumer api. and i think you need to handle >>>> offsets yourself, unless things changed since i last looked. >>>> >>>> I would do second approach. >>>> >>>> On Sat, Apr 18, 2015 at 2:42 PM, Shushant Arora < >>>> shushantaror...@gmail.com> wrote: >>>> >>>>> Thanks !! >>>>> I have few more doubts : >>>>> >>>>> Does kafka RDD uses simpleAPI for kafka consumer or highlevel API, I >>>>> mean do I need to handle offset of partitions myself or it will be taken >>>>> care by KafkaRDD, Plus which one is better for batch programming. I have a >>>>> requirement to read kafka messages by a spark job at every 2 hours >>>>> interval. >>>>> >>>>> 1.One approach is to use spark stream(with stream duration as 2 hours) >>>>> + kafka - My doubt is -Is spark stream stable enough to handle cluster >>>>> outage, If spark cluster gets restart , will the stream application be >>>>> able >>>>> to handle it or I need to restart stream application and pass last offsets >>>>> or how is it gonna work ?Plus will the executor nodes be different in each >>>>> run of stream interval or once decided the same nodes will be used >>>>> throughout the application life ? Spark stream use high level Api for >>>>> kafka >>>>> integration ? >>>>> >>>>> 2.Second Approach is to Use spark batch job and fire a new job at >>>>> every 2 hour interval- use kafka RDD to read from kafka, Now doubt is who >>>>> will maintain the offset of last read messages- my application need to >>>>> maintain it or I can use high level API here somehow? >>>>> >>>>> Thanks >>>>> Shushant >>>>> >>>>> >>>>> >>>>> On Sat, Apr 18, 2015 at 9:09 PM, Ilya Ganelin <ilgan...@gmail.com> >>>>> wrote: >>>>> >>>>>> That's a much better idea :) >>>>>> >>>>>> On Sat, Apr 18, 2015 at 11:22 AM Koert Kuipers <ko...@tresata.com> >>>>>> wrote: >>>>>> >>>>>>> Use KafkaRDD directly. It is in spark-streaming-kafka package >>>>>>> >>>>>>> On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora < >>>>>>> shushantaror...@gmail.com> wrote: >>>>>>> >>>>>>>> Hi >>>>>>>> >>>>>>>> I want to consume messages from kafka queue using spark batch >>>>>>>> program not spark streaming, Is there any way to achieve this, other >>>>>>>> than >>>>>>>> using low level(simple api) of kafka consumer. >>>>>>>> >>>>>>>> Thanks >>>>>>>> >>>>>>> >>>>>>> >>>>> >>>> >>> >> >