Have a look at KafkaRDD
https://spark.apache.org/docs/latest/api/java/org/apache/spark/streaming/kafka/KafkaRDD.html
Thanks
Best Regards
On Wed, Apr 29, 2015 at 10:04 AM, dgoldenberg dgoldenberg...@gmail.com
wrote:
Hi,
I'm wondering about the use-case where you're not doing continuous,
Yes, and Kafka topics are basically queues. So perhaps what's needed is just
KafkaRDD with starting offset being 0 and finish offset being a very large
number...
Sent from my iPhone
On Apr 29, 2015, at 1:52 AM, ayan guha guha.a...@gmail.com wrote:
I guess what you mean is not streaming.
I guess what you mean is not streaming. If you create a stream context at
time t, you will receive data coming through starting time t++, not before
time t.
Looks like you want a queue. Let Kafka write to a queue, consume msgs from
the queue and stop when queue is empty.
On 29 Apr 2015 14:35,
Part of the issues is, when you read messages in a topic, the messages are
peeked, not polled, so there'll be no when the queue is empty, as I
understand it.
So it would seem I'd want to do KafkaUtils.createRDD, which takes an array
of OffsetRange's. Each OffsetRange is characterized by topic,
The idea of peek vs poll doesn't apply to kafka, because kafka is not a
queue.
There are two ways of doing what you want, either using KafkaRDD or a
direct stream
The Kafka rdd approach would require you to find the beginning and ending
offsets for each partition. For an example of this, see
Thanks for the comments, Cody.
Granted, Kafka topics aren't queues. I was merely wishing that Kafka's
topics had some queue behaviors supported because often that is exactly
what one wants. The ability to poll messages off a topic seems like what
lots of use-cases would want.
I'll explore both
Hi,
I'm wondering about the use-case where you're not doing continuous,
incremental streaming of data out of Kafka but rather want to publish data
once with your Producer(s) and consume it once, in your Consumer, then
terminate the consumer Spark job.
JavaStreamingContext jssc = new