Re: spark with kafka

2015-04-19 Thread Cody Koeninger
Take a look at https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md if you haven't already. If you're fine with saving offsets yourself, I'd stick with KafkaRDD, as Koert said. I haven't tried 2 hour stream batch durations, so I can't vouch for using createDirectStream in that

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
I mean to say it is simpler in case of failures, restarts, upgrades, etc. Not just failures. But they did do a lot of work on streaming from kafka in spark 1.3.x to make it simpler (streaming simple calls KafkaRDD for every batch if you use KafkaUtils.createDirectStream), so maybe i am wrong and s

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
Yeah I think would pick the second approach because it is simpler operationally in case of any failures. But of course the smaller the window gets the more attractive the streaming solution gets. We do daily extracts, not every 2 hours. On Sat, Apr 18, 2015 at 2:57 PM, Shushant Arora wrote: > T

Re: spark with kafka

2015-04-18 Thread Shushant Arora
Thanks Koert. So in short for Highlevel api I ll have to go with spark streaming only and there the issue is of handling cluster restart , thats why you opted for second approach of batch job or due to batch interval (2 hours is large for stream job) or some other reason? On Sun, Apr 19, 2015 a

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
KafkaRDD uses the simple consumer api. and i think you need to handle offsets yourself, unless things changed since i last looked. I would do second approach. On Sat, Apr 18, 2015 at 2:42 PM, Shushant Arora wrote: > Thanks !! > I have few more doubts : > > Does kafka RDD uses simpleAPI for kafk

Re: spark with kafka

2015-04-18 Thread Shushant Arora
Thanks !! I have few more doubts : Does kafka RDD uses simpleAPI for kafka consumer or highlevel API, I mean do I need to handle offset of partitions myself or it will be taken care by KafkaRDD, Plus which one is better for batch programming. I have a requirement to read kafka messages by a spark

Re: spark with kafka

2015-04-18 Thread Ilya Ganelin
That's a much better idea :) On Sat, Apr 18, 2015 at 11:22 AM Koert Kuipers wrote: > Use KafkaRDD directly. It is in spark-streaming-kafka package > > On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora > wrote: > >> Hi >> >> I want to consume messages from kafka queue using spark batch program not

Re: spark with kafka

2015-04-18 Thread Koert Kuipers
Use KafkaRDD directly. It is in spark-streaming-kafka package On Sat, Apr 18, 2015 at 6:43 AM, Shushant Arora wrote: > Hi > > I want to consume messages from kafka queue using spark batch program not > spark streaming, Is there any way to achieve this, other than using low > level(simple api) of

RE: spark with kafka

2015-04-18 Thread Ganelin, Ilya
Write Kafka stream to HDFS via Spark streaming then ingest files via Spark from HDFS. Sent with Good (www.good.com) -Original Message- From: Shushant Arora [shushantaror...@gmail.com] Sent: Saturday, April 18, 2015 06:44 AM Eastern Standard Time To: