KafkaRDD uses the simple consumer api. and i think you need to handle offsets yourself, unless things changed since i last looked.
I would do second approach. On Sat, Apr 18, 2015 at 2:42 PM, Shushant Arora <shushantaror...@gmail.com> wrote: > Thanks !! > I have few more doubts : > > Does kafka RDD uses simpleAPI for kafka consumer or highlevel API, I mean > do I need to handle offset of partitions myself or it will be taken care by > KafkaRDD, Plus which one is better for batch programming. I have a > requirement to read kafka messages by a spark job at every 2 hours > interval. > > 1.One approach is to use spark stream(with stream duration as 2 hours) + > kafka - My doubt is -Is spark stream stable enough to handle cluster > outage, If spark cluster gets restart , will the stream application be able > to handle it or I need to restart stream application and pass last offsets > or how is it gonna work ?Plus will the executor nodes be different in each > run of stream interval or once decided the same nodes will be used > throughout the application life ? Spark stream use high level Api for kafka > integration ? > > 2.Second Approach is to Use spark batch job and fire a new job at every > 2 hour interval- use kafka RDD to read from kafka, Now doubt is who will > maintain the offset of last read messages- my application need to maintain > it or I can use high level API here somehow? > > Thanks > Shushant > > > > On Sat, Apr 18, 2015 at 9:09 PM, Ilya Ganelin <ilgan...@gmail.com> wrote: > >> That's a much better idea :) >> >> On Sat, Apr 18, 2015 at 11:22 AM Koert Kuipers <ko...@tresata.com> wrote: >> >>> Use KafkaRDD directly. It is in spark-streaming-kafka package >>> >>> On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora < >>> shushantaror...@gmail.com> wrote: >>> >>>> Hi >>>> >>>> I want to consume messages from kafka queue using spark batch program >>>> not spark streaming, Is there any way to achieve this, other than using low >>>> level(simple api) of kafka consumer. >>>> >>>> Thanks >>>> >>> >>> >