Re: Use KafkaRDD to Batch Process Messages from Kafka

2016-01-22 Thread Charles Chao
Thanks a lot for the help! I'll definately check out the KafkaCluster.scala. I probably first try use that api from java, and later try to build the subproject. thanks, Charles On Fri, Jan 22, 2016 at 12:26 PM, Cody Koeninger wrote: > Yes, you should query Kafka if you want to know the latest

Re: Use KafkaRDD to Batch Process Messages from Kafka

2016-01-22 Thread Cody Koeninger
Yes, you should query Kafka if you want to know the latest available offsets. There's code to make this straightforward in KafkaCluster.scala, but the interface isnt public. There's an outstanding pull request to expose the api at https://issues.apache.org/jira/browse/SPARK-10963 but frankly it

Use KafkaRDD to Batch Process Messages from Kafka

2016-01-22 Thread Charles Chao
Hi, I have been using DirectKafkaInputDStream in Spark Streaming to consumer kafka messages and it's been working very well. Now I have the need to batch process messages from Kafka, for example, retrieve all messages every hour and process them, output to destinations like Hive or HDFS. I woul