Thanks !!
I have few more doubts :

Does kafka RDD uses simpleAPI for kafka consumer or highlevel API, I mean
do I need to handle offset of partitions myself or it will be taken care by
KafkaRDD, Plus which one is better for batch programming. I have a
requirement to read kafka messages by a spark job at  every 2 hours
interval.

1.One approach is to use spark stream(with stream duration as 2 hours) +
kafka - My doubt is -Is spark stream stable enough to handle cluster
outage, If spark cluster gets restart , will the stream application be able
to handle it or I need to restart stream application and pass last offsets
or how is it gonna work ?Plus will the executor nodes be different in each
run of stream interval or once decided the same nodes will be used
throughout the application life ? Spark stream use high level Api for kafka
integration ?

2.Second Approach  is to Use spark batch job and fire a new  job at every 2
hour interval- use kafka RDD to read from kafka, Now doubt is who will
maintain the offset of last read messages- my application need to maintain
it or I can use high level API here somehow?

Thanks
Shushant



On Sat, Apr 18, 2015 at 9:09 PM, Ilya Ganelin <ilgan...@gmail.com> wrote:

> That's a much better idea :)
>
> On Sat, Apr 18, 2015 at 11:22 AM Koert Kuipers <ko...@tresata.com> wrote:
>
>> Use KafkaRDD directly. It is in spark-streaming-kafka package
>>
>> On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora <
>> shushantaror...@gmail.com> wrote:
>>
>>> Hi
>>>
>>> I want to consume messages from kafka queue using spark batch program
>>> not spark streaming, Is there any way to achieve this, other than using low
>>> level(simple api) of kafka consumer.
>>>
>>> Thanks
>>>
>>
>>

Reply via email to