Recovery for Spark Streaming Kafka Direct with OffsetOutOfRangeException

2016-01-21 Thread Dan Dutrow
Hey Cody, I would have responded to the mailing list but it looks like this thread got aged off. I have the problem where one of my topics dumps more data than my spark job can keep up with. We limit the input rate with maxRatePerPartition Eventually, when the data is aged off, I get the OffsetOutO

Re: Spark Stream Monitoring with Kafka Direct API

2015-12-09 Thread Dan Dutrow
ranges. See the definition of count() in recent versions of > KafkaRDD for an example. > > On Wed, Dec 9, 2015 at 9:39 AM, Dan Dutrow wrote: > >> Is there documentation for how to update the metrics (#messages per >> batch) in the Spark Streaming tab when using the Direct

Spark Stream Monitoring with Kafka Direct API

2015-12-09 Thread Dan Dutrow
Is there documentation for how to update the metrics (#messages per batch) in the Spark Streaming tab when using the Direct API? Does the Streaming tab get its information from Zookeeper or something else internally? -- Dan ✆

Re: Kafka - streaming from multiple topics

2015-12-03 Thread Dan Dutrow
a particular partition, will that partition be ignored? On Wed, Dec 2, 2015 at 3:17 PM Cody Koeninger wrote: > Use the direct stream. You can put multiple topics in a single stream, > and differentiate them on a per-partition basis using the offset range. > > On Wed, Dec 2,

Re: Kafka - streaming from multiple topics

2015-12-02 Thread dutrow
After digging into the Kafka code some more (specifically kafka.consumer.KafkaStream, kafka.consumer.ConsumerIterator and kafka.message.MessageAndMetadata), it appears that the Left value of the tuple is not the topic name but rather a key that Kafka puts on each message. See http://kafka.apache.or

Re: Kafka - streaming from multiple topics

2015-12-02 Thread Dan Dutrow
e direct stream. You can put multiple topics in a single stream, > and differentiate them on a per-partition basis using the offset range. > > On Wed, Dec 2, 2015 at 2:13 PM, dutrow wrote: > >> I found the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2388 >

Re: Kafka - streaming from multiple topics

2015-12-02 Thread dutrow
I found the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-2388 It was marked as invalid. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Kafka-streaming-from-multiple-topics-tp8678p25550.html Sent from the Apache Spark User List mailing list arch

Re: Kafka - streaming from multiple topics

2015-12-02 Thread dutrow
My need is similar; I have 10+ topics and don't want to dedicate 10 cores to processing all of them. Like yourself and others, the (String, String) pair that comes out of the DStream has (null, StringData...) values instead of (topic name, StringData...) Did anyone ever find a way around this issu

Re: Different Kafka createDirectStream implementations

2015-09-08 Thread Dan Dutrow
for determining whether to start the > stream at the beginning or the end of the log... if someone's provided > explicit offsets, it's pretty clear where to start the stream. > > On Tue, Sep 8, 2015 at 1:19 PM, Dan Dutrow wrote: > >> Yes, but one implementation chec

Re: Different Kafka createDirectStream implementations

2015-09-08 Thread Dan Dutrow
ue, Sep 8, 2015 at 11:44 AM, Dan Dutrow wrote: > >> The two methods of createDirectStream appear to have different >> implementations, the second checks the offset.reset flags and does some >> error handling while the first does not. Besides the use of a >> messageHandler

Different Kafka createDirectStream implementations

2015-09-08 Thread Dan Dutrow
The two methods of createDirectStream appear to have different implementations, the second checks the offset.reset flags and does some error handling while the first does not. Besides the use of a messageHandler, are they intended to be used in different situations? def createDirectStream[ K: Clas

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread Dan Dutrow
ave offsets to ZK, you can. The right way to make that easier is > to expose the (currently private) methods that already exist in > KafkaCluster.scala for committing offsets through Kafka's api. I don't > think adding another "do the wrong thing" option is beneficial

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread dutrow
In summary, it appears that the use of the DirectAPI was intended specifically to enable exactly-once semantics. This can be achieved for idempotent transformations and with transactional processing using the database to guarantee an "onto" mapping of results based on inputs. For the latter, you ne

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread dutrow
For those who find this post and may be interested, the most thorough documentation on the subject may be found here: https://github.com/koeninger/kafka-exactly-once -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Maintaining-Kafka-Direct-API-Offsets-tp24246

Re: Maintaining Kafka Direct API Offsets

2015-08-14 Thread dutrow
How do I get beyond the "This post has NOT been accepted by the mailing list yet" message? This message was posted through the nabble interface; one would think that would be enough to get the message accepted. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com