Thanks !! I have few more doubts : Does kafka RDD uses simpleAPI for kafka consumer or highlevel API, I mean do I need to handle offset of partitions myself or it will be taken care by KafkaRDD, Plus which one is better for batch programming. I have a requirement to read kafka messages by a spark job at every 2 hours interval.
1.One approach is to use spark stream(with stream duration as 2 hours) + kafka - My doubt is -Is spark stream stable enough to handle cluster outage, If spark cluster gets restart , will the stream application be able to handle it or I need to restart stream application and pass last offsets or how is it gonna work ?Plus will the executor nodes be different in each run of stream interval or once decided the same nodes will be used throughout the application life ? Spark stream use high level Api for kafka integration ? 2.Second Approach is to Use spark batch job and fire a new job at every 2 hour interval- use kafka RDD to read from kafka, Now doubt is who will maintain the offset of last read messages- my application need to maintain it or I can use high level API here somehow? Thanks Shushant On Sat, Apr 18, 2015 at 9:09 PM, Ilya Ganelin <ilgan...@gmail.com> wrote: > That's a much better idea :) > > On Sat, Apr 18, 2015 at 11:22 AM Koert Kuipers <ko...@tresata.com> wrote: > >> Use KafkaRDD directly. It is in spark-streaming-kafka package >> >> On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora < >> shushantaror...@gmail.com> wrote: >> >>> Hi >>> >>> I want to consume messages from kafka queue using spark batch program >>> not spark streaming, Is there any way to achieve this, other than using low >>> level(simple api) of kafka consumer. >>> >>> Thanks >>> >> >>