Re: Spark Streaming from Kafka, deal with initial heavy load.

2017-03-20 Thread Cody Koeninger
You want spark.streaming.kafka.maxRatePerPartition for the direct stream.

On Sat, Mar 18, 2017 at 3:37 PM, Mal Edwin  wrote:
>
> Hi,
> You can enable backpressure to handle this.
>
> spark.streaming.backpressure.enabled
> spark.streaming.receiver.maxRate
>
> Thanks,
> Edwin
>
> On Mar 18, 2017, 12:53 AM -0400, sagarcasual . ,
> wrote:
>
> Hi, we have spark 1.6.1 streaming from Kafka (0.10.1) topic using direct
> approach. The streaming part works fine but when we initially start the job,
> we have to deal with really huge Kafka message backlog, millions of
> messages, and that first batch runs for over 40 hours,  and after 12 hours
> or so it becomes very very slow, it keeps crunching messages, but at a very
> low speed. Any idea how to overcome this issue? Once the job is all caught
> up, subsequent batches are quick and fast since the load is really tiny to
> process. So any idea how to avoid this problem?
>
>
>

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark Streaming from Kafka, deal with initial heavy load.

2017-03-18 Thread Mal Edwin

Hi,
You can enable backpressure to handle this.

spark.streaming.backpressure.enabled
spark.streaming.receiver.maxRate

Thanks,
Edwin

On Mar 18, 2017, 12:53 AM -0400, sagarcasual . , wrote:
> Hi, we have spark 1.6.1 streaming from Kafka (0.10.1) topic using direct 
> approach. The streaming part works fine but when we initially start the job, 
> we have to deal with really huge Kafka message backlog, millions of messages, 
> and that first batch runs for over 40 hours,  and after 12 hours or so it 
> becomes very very slow, it keeps crunching messages, but at a very low speed. 
> Any idea how to overcome this issue? Once the job is all caught up, 
> subsequent batches are quick and fast since the load is really tiny to 
> process. So any idea how to avoid this problem?




Spark Streaming from Kafka, deal with initial heavy load.

2017-03-17 Thread sagarcasual .
Hi, we have spark 1.6.1 streaming from Kafka (0.10.1) topic using direct
approach. The streaming part works fine but when we initially start the
job, we have to deal with really huge Kafka message backlog, millions of
messages, and that first batch runs for over 40 hours,  and after 12 hours
or so it becomes very very slow, it keeps crunching messages, but at a very
low speed. Any idea how to overcome this issue? Once the job is all caught
up, subsequent batches are quick and fast since the load is really tiny to
process. So any idea how to avoid this problem?


Re: Spark streaming from Kafka best fit

2016-03-07 Thread pratik khadloya
Would using mapPartitions instead of map help here?

~Pratik

On Tue, Mar 1, 2016 at 10:07 AM Cody Koeninger  wrote:

> You don't need an equal number of executor cores to partitions.  An
> executor can and will work on multiple partitions within a batch, one after
> the other.  The real issue is whether you are able to keep your processing
> time under your batch time, so that delay doesn't increase.
>
> On Tue, Mar 1, 2016 at 11:59 AM, Jatin Kumar 
> wrote:
>
>> Thanks Cody!
>>
>> I understand what you said and if I am correct it will be using 224
>> executor cores just for fetching + stage-1 processing of 224 partitions. I
>> will obviously need more cores for processing further stages and fetching
>> next batch.
>>
>> I will start with higher number of executor cores and see how it goes.
>>
>> --
>> Thanks
>> Jatin Kumar | Rocket Scientist
>> +91-7696741743 m
>>
>> On Tue, Mar 1, 2016 at 9:07 PM, Cody Koeninger 
>> wrote:
>>
>>> > "How do I keep a balance of executors which receive data from Kafka
>>> and which process data"
>>>
>>> I think you're misunderstanding how the direct stream works.  The
>>> executor which receives data is also the executor which processes data,
>>> there aren't separate receivers.  If it's a single stage worth of work
>>> (e.g. straight map / filter), the processing of a given partition is going
>>> to be done by the executor that read it from kafka.  If you do something
>>> involving a shuffle (e.g. reduceByKey), other executors will do additional
>>> processing.  The question of which executor works on which tasks is up to
>>> the scheduler (and getPreferredLocations, which only matters if you're
>>> running spark on the same nodes as kafka)
>>>
>>> On Tue, Mar 1, 2016 at 2:36 AM, Jatin Kumar <
>>> jku...@rocketfuelinc.com.invalid> wrote:
>>>
 Hello all,

 I see that there are as of today 3 ways one can read from Kafka in
 spark streaming:
 1. KafkaUtils.createStream() (here
 )
 2. KafkaUtils.createDirectStream() (here
 )
 3. Kafka-spark-consumer (here
 )

 My spark streaming application has to read from 1 kafka topic with
 around 224 partitions, consuming data at around 150MB/s (~90,000
 messages/sec) which reduces to around 3MB/s (~1400 messages/sec) after
 filtering. After filtering I need to maintain top 1 URL counts. I don't
 really care about exactly once semantics as I am interested in rough
 estimate.

 Code:

 sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "false")
 sparkConf.setAppName("KafkaReader")
 val ssc = StreamingContext.getOrCreate(kCheckPointDir, 
 createStreamingContext)

 val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
 val kafkaParams = Map[String, String](
   "metadata.broker.list" -> "kafka.server.ip:9092",
   "group.id" -> consumer_group
 )

 val lineStreams = (1 to N).map{ _ =>
   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
 ssc, kafkaParams, topicMap, StorageLevel.MEMORY_AND_DISK).map(_._2)
 }

 ssc.union(
   lineStreams.map(stream => {
   stream.map(ParseStringToLogRecord)
 .filter(record => isGoodRecord(record))
 .map(record => record.url)
   })
 ).window(Seconds(120), Seconds(120))  // 2 Minute window
   .countByValueAndWindow(Seconds(1800), Seconds(120), 28) // 30 Minute 
 moving window, 28 will probably help in parallelism
   .filter(urlCountTuple => urlCountTuple._2 > MIN_COUNT_THRESHHOLD)
   .mapPartitions(iter => {
 iter.toArray.sortWith((r1, r2) => r1._2.compare(r2._2) < 0).slice(0, 
 1000).iterator
   }, true)
   .foreachRDD((latestRDD, rddTime) => {
   printTopFromRDD(rddTime, latestRDD.map(record => (record._2, 
 record._1)).sortByKey(false).take(1000))
   })

 ssc.start()
 ssc.awaitTermination()

 Questions:

 a) I used #2 but I found that I couldn't control how many executors
 will be actually fetching from Kafka. How do I keep a balance of executors
 which receive data from Kafka and which process data? Do they keep changing
 for every batch?

 b) Now I am trying to use #1 creating multiple DStreams, filtering them
 and then doing a union. I don't understand why would the number of events
 processed per 120 seconds batch will change drastically. PFA the events/sec
 graph while running with 1 receiver. How to debug this?

 c) What will be the most suitable method to integrate with Kafka from
 above 3? Any recommendations for getting maximum performance, running the
 streaming application reliably in production environment?

 --
 Thanks
 Jatin Kumar


>>

Re: Spark streaming from Kafka best fit

2016-03-01 Thread Cody Koeninger
You don't need an equal number of executor cores to partitions.  An
executor can and will work on multiple partitions within a batch, one after
the other.  The real issue is whether you are able to keep your processing
time under your batch time, so that delay doesn't increase.

On Tue, Mar 1, 2016 at 11:59 AM, Jatin Kumar 
wrote:

> Thanks Cody!
>
> I understand what you said and if I am correct it will be using 224
> executor cores just for fetching + stage-1 processing of 224 partitions. I
> will obviously need more cores for processing further stages and fetching
> next batch.
>
> I will start with higher number of executor cores and see how it goes.
>
> --
> Thanks
> Jatin Kumar | Rocket Scientist
> +91-7696741743 m
>
> On Tue, Mar 1, 2016 at 9:07 PM, Cody Koeninger  wrote:
>
>> > "How do I keep a balance of executors which receive data from Kafka
>> and which process data"
>>
>> I think you're misunderstanding how the direct stream works.  The
>> executor which receives data is also the executor which processes data,
>> there aren't separate receivers.  If it's a single stage worth of work
>> (e.g. straight map / filter), the processing of a given partition is going
>> to be done by the executor that read it from kafka.  If you do something
>> involving a shuffle (e.g. reduceByKey), other executors will do additional
>> processing.  The question of which executor works on which tasks is up to
>> the scheduler (and getPreferredLocations, which only matters if you're
>> running spark on the same nodes as kafka)
>>
>> On Tue, Mar 1, 2016 at 2:36 AM, Jatin Kumar <
>> jku...@rocketfuelinc.com.invalid> wrote:
>>
>>> Hello all,
>>>
>>> I see that there are as of today 3 ways one can read from Kafka in spark
>>> streaming:
>>> 1. KafkaUtils.createStream() (here
>>> )
>>> 2. KafkaUtils.createDirectStream() (here
>>> )
>>> 3. Kafka-spark-consumer (here
>>> )
>>>
>>> My spark streaming application has to read from 1 kafka topic with
>>> around 224 partitions, consuming data at around 150MB/s (~90,000
>>> messages/sec) which reduces to around 3MB/s (~1400 messages/sec) after
>>> filtering. After filtering I need to maintain top 1 URL counts. I don't
>>> really care about exactly once semantics as I am interested in rough
>>> estimate.
>>>
>>> Code:
>>>
>>> sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "false")
>>> sparkConf.setAppName("KafkaReader")
>>> val ssc = StreamingContext.getOrCreate(kCheckPointDir, 
>>> createStreamingContext)
>>>
>>> val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
>>> val kafkaParams = Map[String, String](
>>>   "metadata.broker.list" -> "kafka.server.ip:9092",
>>>   "group.id" -> consumer_group
>>> )
>>>
>>> val lineStreams = (1 to N).map{ _ =>
>>>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
>>> ssc, kafkaParams, topicMap, StorageLevel.MEMORY_AND_DISK).map(_._2)
>>> }
>>>
>>> ssc.union(
>>>   lineStreams.map(stream => {
>>>   stream.map(ParseStringToLogRecord)
>>> .filter(record => isGoodRecord(record))
>>> .map(record => record.url)
>>>   })
>>> ).window(Seconds(120), Seconds(120))  // 2 Minute window
>>>   .countByValueAndWindow(Seconds(1800), Seconds(120), 28) // 30 Minute 
>>> moving window, 28 will probably help in parallelism
>>>   .filter(urlCountTuple => urlCountTuple._2 > MIN_COUNT_THRESHHOLD)
>>>   .mapPartitions(iter => {
>>> iter.toArray.sortWith((r1, r2) => r1._2.compare(r2._2) < 0).slice(0, 
>>> 1000).iterator
>>>   }, true)
>>>   .foreachRDD((latestRDD, rddTime) => {
>>>   printTopFromRDD(rddTime, latestRDD.map(record => (record._2, 
>>> record._1)).sortByKey(false).take(1000))
>>>   })
>>>
>>> ssc.start()
>>> ssc.awaitTermination()
>>>
>>> Questions:
>>>
>>> a) I used #2 but I found that I couldn't control how many executors will
>>> be actually fetching from Kafka. How do I keep a balance of executors which
>>> receive data from Kafka and which process data? Do they keep changing for
>>> every batch?
>>>
>>> b) Now I am trying to use #1 creating multiple DStreams, filtering them
>>> and then doing a union. I don't understand why would the number of events
>>> processed per 120 seconds batch will change drastically. PFA the events/sec
>>> graph while running with 1 receiver. How to debug this?
>>>
>>> c) What will be the most suitable method to integrate with Kafka from
>>> above 3? Any recommendations for getting maximum performance, running the
>>> streaming application reliably in production environment?
>>>
>>> --
>>> Thanks
>>> Jatin Kumar
>>>
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>
>>
>


Re: Spark streaming from Kafka best fit

2016-03-01 Thread Jatin Kumar
Thanks Cody!

I understand what you said and if I am correct it will be using 224
executor cores just for fetching + stage-1 processing of 224 partitions. I
will obviously need more cores for processing further stages and fetching
next batch.

I will start with higher number of executor cores and see how it goes.

--
Thanks
Jatin Kumar | Rocket Scientist
+91-7696741743 m

On Tue, Mar 1, 2016 at 9:07 PM, Cody Koeninger  wrote:

> > "How do I keep a balance of executors which receive data from Kafka and
> which process data"
>
> I think you're misunderstanding how the direct stream works.  The executor
> which receives data is also the executor which processes data, there aren't
> separate receivers.  If it's a single stage worth of work (e.g. straight
> map / filter), the processing of a given partition is going to be done by
> the executor that read it from kafka.  If you do something involving a
> shuffle (e.g. reduceByKey), other executors will do additional processing.
> The question of which executor works on which tasks is up to the scheduler
> (and getPreferredLocations, which only matters if you're running spark on
> the same nodes as kafka)
>
> On Tue, Mar 1, 2016 at 2:36 AM, Jatin Kumar <
> jku...@rocketfuelinc.com.invalid> wrote:
>
>> Hello all,
>>
>> I see that there are as of today 3 ways one can read from Kafka in spark
>> streaming:
>> 1. KafkaUtils.createStream() (here
>> )
>> 2. KafkaUtils.createDirectStream() (here
>> )
>> 3. Kafka-spark-consumer (here
>> )
>>
>> My spark streaming application has to read from 1 kafka topic with around
>> 224 partitions, consuming data at around 150MB/s (~90,000 messages/sec)
>> which reduces to around 3MB/s (~1400 messages/sec) after filtering. After
>> filtering I need to maintain top 1 URL counts. I don't really care
>> about exactly once semantics as I am interested in rough estimate.
>>
>> Code:
>>
>> sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "false")
>> sparkConf.setAppName("KafkaReader")
>> val ssc = StreamingContext.getOrCreate(kCheckPointDir, 
>> createStreamingContext)
>>
>> val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
>> val kafkaParams = Map[String, String](
>>   "metadata.broker.list" -> "kafka.server.ip:9092",
>>   "group.id" -> consumer_group
>> )
>>
>> val lineStreams = (1 to N).map{ _ =>
>>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
>> ssc, kafkaParams, topicMap, StorageLevel.MEMORY_AND_DISK).map(_._2)
>> }
>>
>> ssc.union(
>>   lineStreams.map(stream => {
>>   stream.map(ParseStringToLogRecord)
>> .filter(record => isGoodRecord(record))
>> .map(record => record.url)
>>   })
>> ).window(Seconds(120), Seconds(120))  // 2 Minute window
>>   .countByValueAndWindow(Seconds(1800), Seconds(120), 28) // 30 Minute 
>> moving window, 28 will probably help in parallelism
>>   .filter(urlCountTuple => urlCountTuple._2 > MIN_COUNT_THRESHHOLD)
>>   .mapPartitions(iter => {
>> iter.toArray.sortWith((r1, r2) => r1._2.compare(r2._2) < 0).slice(0, 
>> 1000).iterator
>>   }, true)
>>   .foreachRDD((latestRDD, rddTime) => {
>>   printTopFromRDD(rddTime, latestRDD.map(record => (record._2, 
>> record._1)).sortByKey(false).take(1000))
>>   })
>>
>> ssc.start()
>> ssc.awaitTermination()
>>
>> Questions:
>>
>> a) I used #2 but I found that I couldn't control how many executors will
>> be actually fetching from Kafka. How do I keep a balance of executors which
>> receive data from Kafka and which process data? Do they keep changing for
>> every batch?
>>
>> b) Now I am trying to use #1 creating multiple DStreams, filtering them
>> and then doing a union. I don't understand why would the number of events
>> processed per 120 seconds batch will change drastically. PFA the events/sec
>> graph while running with 1 receiver. How to debug this?
>>
>> c) What will be the most suitable method to integrate with Kafka from
>> above 3? Any recommendations for getting maximum performance, running the
>> streaming application reliably in production environment?
>>
>> --
>> Thanks
>> Jatin Kumar
>>
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>


Re: Spark streaming from Kafka best fit

2016-03-01 Thread Cody Koeninger
> "How do I keep a balance of executors which receive data from Kafka and
which process data"

I think you're misunderstanding how the direct stream works.  The executor
which receives data is also the executor which processes data, there aren't
separate receivers.  If it's a single stage worth of work (e.g. straight
map / filter), the processing of a given partition is going to be done by
the executor that read it from kafka.  If you do something involving a
shuffle (e.g. reduceByKey), other executors will do additional processing.
The question of which executor works on which tasks is up to the scheduler
(and getPreferredLocations, which only matters if you're running spark on
the same nodes as kafka)

On Tue, Mar 1, 2016 at 2:36 AM, Jatin Kumar <
jku...@rocketfuelinc.com.invalid> wrote:

> Hello all,
>
> I see that there are as of today 3 ways one can read from Kafka in spark
> streaming:
> 1. KafkaUtils.createStream() (here
> )
> 2. KafkaUtils.createDirectStream() (here
> )
> 3. Kafka-spark-consumer (here
> )
>
> My spark streaming application has to read from 1 kafka topic with around
> 224 partitions, consuming data at around 150MB/s (~90,000 messages/sec)
> which reduces to around 3MB/s (~1400 messages/sec) after filtering. After
> filtering I need to maintain top 1 URL counts. I don't really care
> about exactly once semantics as I am interested in rough estimate.
>
> Code:
>
> sparkConf.set("spark.streaming.receiver.writeAheadLog.enable", "false")
> sparkConf.setAppName("KafkaReader")
> val ssc = StreamingContext.getOrCreate(kCheckPointDir, createStreamingContext)
>
> val topicMap = topics.split(",").map((_, numThreads.toInt)).toMap
> val kafkaParams = Map[String, String](
>   "metadata.broker.list" -> "kafka.server.ip:9092",
>   "group.id" -> consumer_group
> )
>
> val lineStreams = (1 to N).map{ _ =>
>   KafkaUtils.createStream[String, String, StringDecoder, StringDecoder](
> ssc, kafkaParams, topicMap, StorageLevel.MEMORY_AND_DISK).map(_._2)
> }
>
> ssc.union(
>   lineStreams.map(stream => {
>   stream.map(ParseStringToLogRecord)
> .filter(record => isGoodRecord(record))
> .map(record => record.url)
>   })
> ).window(Seconds(120), Seconds(120))  // 2 Minute window
>   .countByValueAndWindow(Seconds(1800), Seconds(120), 28) // 30 Minute moving 
> window, 28 will probably help in parallelism
>   .filter(urlCountTuple => urlCountTuple._2 > MIN_COUNT_THRESHHOLD)
>   .mapPartitions(iter => {
> iter.toArray.sortWith((r1, r2) => r1._2.compare(r2._2) < 0).slice(0, 
> 1000).iterator
>   }, true)
>   .foreachRDD((latestRDD, rddTime) => {
>   printTopFromRDD(rddTime, latestRDD.map(record => (record._2, 
> record._1)).sortByKey(false).take(1000))
>   })
>
> ssc.start()
> ssc.awaitTermination()
>
> Questions:
>
> a) I used #2 but I found that I couldn't control how many executors will
> be actually fetching from Kafka. How do I keep a balance of executors which
> receive data from Kafka and which process data? Do they keep changing for
> every batch?
>
> b) Now I am trying to use #1 creating multiple DStreams, filtering them
> and then doing a union. I don't understand why would the number of events
> processed per 120 seconds batch will change drastically. PFA the events/sec
> graph while running with 1 receiver. How to debug this?
>
> c) What will be the most suitable method to integrate with Kafka from
> above 3? Any recommendations for getting maximum performance, running the
> streaming application reliably in production environment?
>
> --
> Thanks
> Jatin Kumar
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>


Re: How to monitor Spark Streaming from Kafka?

2015-06-02 Thread Ruslan Dautkhanov
Nobody mentioned CM yet? Kafka is now supported by CM/CDH 5.4

http://www.cloudera.com/content/cloudera/en/documentation/cloudera-kafka/latest/PDF/cloudera-kafka.pdf




-- 
Ruslan Dautkhanov

On Mon, Jun 1, 2015 at 5:19 PM, Dmitry Goldenberg 
wrote:

> Thank you, Tathagata, Cody, Otis.
>
> - Dmitry
>
>
> On Mon, Jun 1, 2015 at 6:57 PM, Otis Gospodnetic <
> otis.gospodne...@gmail.com> wrote:
>
>> I think you can use SPM - http://sematext.com/spm - it will give you all
>> Spark and all Kafka metrics, including offsets broken down by topic, etc.
>> out of the box.  I see more and more people using it to monitor various
>> components in data processing pipelines, a la
>> http://blog.sematext.com/2015/04/22/monitoring-stream-processing-tools-cassandra-kafka-and-spark/
>>
>> Otis
>>
>> On Mon, Jun 1, 2015 at 5:23 PM, dgoldenberg 
>> wrote:
>>
>>> Hi,
>>>
>>> What are some of the good/adopted approached to monitoring Spark
>>> Streaming
>>> from Kafka?  I see that there are things like
>>> http://quantifind.github.io/KafkaOffsetMonitor, for example.  Do they
>>> all
>>> assume that Receiver-based streaming is used?
>>>
>>> Then "Note that one disadvantage of this approach (Receiverless Approach,
>>> #2) is that it does not update offsets in Zookeeper, hence
>>> Zookeeper-based
>>> Kafka monitoring tools will not show progress. However, you can access
>>> the
>>> offsets processed by this approach in each batch and update Zookeeper
>>> yourself".
>>>
>>> The code sample, however, seems sparse. What do you need to do here? -
>>>  directKafkaStream.foreachRDD(
>>>  new Function, Void>() {
>>>  @Override
>>>  public Void call(JavaPairRDD rdd) throws
>>> IOException {
>>>  OffsetRange[] offsetRanges =
>>> ((HasOffsetRanges)rdd).offsetRanges
>>>  // offsetRanges.length = # of Kafka partitions being
>>> consumed
>>>  ...
>>>  return null;
>>>  }
>>>  }
>>>  );
>>>
>>> and if these are updated, will KafkaOffsetMonitor work?
>>>
>>> Monitoring seems to center around the notion of a consumer group.  But in
>>> the receiverless approach, code on the Spark consumer side doesn't seem
>>> to
>>> expose a consumer group parameter.  Where does it go?  Can I/should I
>>> just
>>> pass in group.id as part of the kafkaParams HashMap?
>>>
>>> Thanks
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-monitor-Spark-Streaming-from-Kafka-tp23103.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Dmitry Goldenberg
Thank you, Tathagata, Cody, Otis.

- Dmitry


On Mon, Jun 1, 2015 at 6:57 PM, Otis Gospodnetic  wrote:

> I think you can use SPM - http://sematext.com/spm - it will give you all
> Spark and all Kafka metrics, including offsets broken down by topic, etc.
> out of the box.  I see more and more people using it to monitor various
> components in data processing pipelines, a la
> http://blog.sematext.com/2015/04/22/monitoring-stream-processing-tools-cassandra-kafka-and-spark/
>
> Otis
>
> On Mon, Jun 1, 2015 at 5:23 PM, dgoldenberg 
> wrote:
>
>> Hi,
>>
>> What are some of the good/adopted approached to monitoring Spark Streaming
>> from Kafka?  I see that there are things like
>> http://quantifind.github.io/KafkaOffsetMonitor, for example.  Do they all
>> assume that Receiver-based streaming is used?
>>
>> Then "Note that one disadvantage of this approach (Receiverless Approach,
>> #2) is that it does not update offsets in Zookeeper, hence Zookeeper-based
>> Kafka monitoring tools will not show progress. However, you can access the
>> offsets processed by this approach in each batch and update Zookeeper
>> yourself".
>>
>> The code sample, however, seems sparse. What do you need to do here? -
>>  directKafkaStream.foreachRDD(
>>  new Function, Void>() {
>>  @Override
>>  public Void call(JavaPairRDD rdd) throws
>> IOException {
>>  OffsetRange[] offsetRanges =
>> ((HasOffsetRanges)rdd).offsetRanges
>>  // offsetRanges.length = # of Kafka partitions being consumed
>>  ...
>>  return null;
>>  }
>>  }
>>  );
>>
>> and if these are updated, will KafkaOffsetMonitor work?
>>
>> Monitoring seems to center around the notion of a consumer group.  But in
>> the receiverless approach, code on the Spark consumer side doesn't seem to
>> expose a consumer group parameter.  Where does it go?  Can I/should I just
>> pass in group.id as part of the kafkaParams HashMap?
>>
>> Thanks
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-monitor-Spark-Streaming-from-Kafka-tp23103.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Otis Gospodnetic
I think you can use SPM - http://sematext.com/spm - it will give you all
Spark and all Kafka metrics, including offsets broken down by topic, etc.
out of the box.  I see more and more people using it to monitor various
components in data processing pipelines, a la
http://blog.sematext.com/2015/04/22/monitoring-stream-processing-tools-cassandra-kafka-and-spark/

Otis

On Mon, Jun 1, 2015 at 5:23 PM, dgoldenberg 
wrote:

> Hi,
>
> What are some of the good/adopted approached to monitoring Spark Streaming
> from Kafka?  I see that there are things like
> http://quantifind.github.io/KafkaOffsetMonitor, for example.  Do they all
> assume that Receiver-based streaming is used?
>
> Then "Note that one disadvantage of this approach (Receiverless Approach,
> #2) is that it does not update offsets in Zookeeper, hence Zookeeper-based
> Kafka monitoring tools will not show progress. However, you can access the
> offsets processed by this approach in each batch and update Zookeeper
> yourself".
>
> The code sample, however, seems sparse. What do you need to do here? -
>  directKafkaStream.foreachRDD(
>  new Function, Void>() {
>  @Override
>  public Void call(JavaPairRDD rdd) throws
> IOException {
>  OffsetRange[] offsetRanges =
> ((HasOffsetRanges)rdd).offsetRanges
>  // offsetRanges.length = # of Kafka partitions being consumed
>  ...
>  return null;
>  }
>  }
>  );
>
> and if these are updated, will KafkaOffsetMonitor work?
>
> Monitoring seems to center around the notion of a consumer group.  But in
> the receiverless approach, code on the Spark consumer side doesn't seem to
> expose a consumer group parameter.  Where does it go?  Can I/should I just
> pass in group.id as part of the kafkaParams HashMap?
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-monitor-Spark-Streaming-from-Kafka-tp23103.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Cody Koeninger
KafkaCluster.scala in the spark/extrernal/kafka project has a bunch of api
code, including code for updating Kafka-managed ZK offsets.  Look at
setConsumerOffsets.

Unfortunately all of that code is private, but you can either write your
own, copy it, or do what I do (sed out private[spark] and rebuild spark...)

On Mon, Jun 1, 2015 at 4:51 PM, Tathagata Das  wrote:

> In the receiver-less "direct" approach, there is no concept of consumer
> group as we dont use the Kafka High Level consumer (that uses ZK). Instead
> Spark Streaming manages offsets on its own, giving tighter guarantees. If
> you want to monitor the progress of the processing of offsets, you will
> have to update ZK yourself. With the code snippet you posted, you can get
> the range of offsets that were processed in each batch, and accordingly
> update Zookeeper using some consumer group name of your choice.
>
> TD
>
> On Mon, Jun 1, 2015 at 2:23 PM, dgoldenberg 
> wrote:
>
>> Hi,
>>
>> What are some of the good/adopted approached to monitoring Spark Streaming
>> from Kafka?  I see that there are things like
>> http://quantifind.github.io/KafkaOffsetMonitor, for example.  Do they all
>> assume that Receiver-based streaming is used?
>>
>> Then "Note that one disadvantage of this approach (Receiverless Approach,
>> #2) is that it does not update offsets in Zookeeper, hence Zookeeper-based
>> Kafka monitoring tools will not show progress. However, you can access the
>> offsets processed by this approach in each batch and update Zookeeper
>> yourself".
>>
>> The code sample, however, seems sparse. What do you need to do here? -
>>  directKafkaStream.foreachRDD(
>>  new Function, Void>() {
>>  @Override
>>  public Void call(JavaPairRDD rdd) throws
>> IOException {
>>  OffsetRange[] offsetRanges =
>> ((HasOffsetRanges)rdd).offsetRanges
>>  // offsetRanges.length = # of Kafka partitions being consumed
>>  ...
>>  return null;
>>  }
>>  }
>>  );
>>
>> and if these are updated, will KafkaOffsetMonitor work?
>>
>> Monitoring seems to center around the notion of a consumer group.  But in
>> the receiverless approach, code on the Spark consumer side doesn't seem to
>> expose a consumer group parameter.  Where does it go?  Can I/should I just
>> pass in group.id as part of the kafkaParams HashMap?
>>
>> Thanks
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-monitor-Spark-Streaming-from-Kafka-tp23103.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: How to monitor Spark Streaming from Kafka?

2015-06-01 Thread Tathagata Das
In the receiver-less "direct" approach, there is no concept of consumer
group as we dont use the Kafka High Level consumer (that uses ZK). Instead
Spark Streaming manages offsets on its own, giving tighter guarantees. If
you want to monitor the progress of the processing of offsets, you will
have to update ZK yourself. With the code snippet you posted, you can get
the range of offsets that were processed in each batch, and accordingly
update Zookeeper using some consumer group name of your choice.

TD

On Mon, Jun 1, 2015 at 2:23 PM, dgoldenberg 
wrote:

> Hi,
>
> What are some of the good/adopted approached to monitoring Spark Streaming
> from Kafka?  I see that there are things like
> http://quantifind.github.io/KafkaOffsetMonitor, for example.  Do they all
> assume that Receiver-based streaming is used?
>
> Then "Note that one disadvantage of this approach (Receiverless Approach,
> #2) is that it does not update offsets in Zookeeper, hence Zookeeper-based
> Kafka monitoring tools will not show progress. However, you can access the
> offsets processed by this approach in each batch and update Zookeeper
> yourself".
>
> The code sample, however, seems sparse. What do you need to do here? -
>  directKafkaStream.foreachRDD(
>  new Function, Void>() {
>  @Override
>  public Void call(JavaPairRDD rdd) throws
> IOException {
>  OffsetRange[] offsetRanges =
> ((HasOffsetRanges)rdd).offsetRanges
>  // offsetRanges.length = # of Kafka partitions being consumed
>  ...
>  return null;
>  }
>  }
>  );
>
> and if these are updated, will KafkaOffsetMonitor work?
>
> Monitoring seems to center around the notion of a consumer group.  But in
> the receiverless approach, code on the Spark consumer side doesn't seem to
> expose a consumer group parameter.  Where does it go?  Can I/should I just
> pass in group.id as part of the kafkaParams HashMap?
>
> Thanks
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-monitor-Spark-Streaming-from-Kafka-tp23103.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


How to monitor Spark Streaming from Kafka?

2015-06-01 Thread dgoldenberg
Hi,

What are some of the good/adopted approached to monitoring Spark Streaming
from Kafka?  I see that there are things like
http://quantifind.github.io/KafkaOffsetMonitor, for example.  Do they all
assume that Receiver-based streaming is used?

Then "Note that one disadvantage of this approach (Receiverless Approach,
#2) is that it does not update offsets in Zookeeper, hence Zookeeper-based
Kafka monitoring tools will not show progress. However, you can access the
offsets processed by this approach in each batch and update Zookeeper
yourself".

The code sample, however, seems sparse. What do you need to do here? -
 directKafkaStream.foreachRDD(
 new Function, Void>() {
 @Override
 public Void call(JavaPairRDD rdd) throws
IOException {
 OffsetRange[] offsetRanges =
((HasOffsetRanges)rdd).offsetRanges
 // offsetRanges.length = # of Kafka partitions being consumed
 ...
 return null;
 }
 }
 );

and if these are updated, will KafkaOffsetMonitor work?

Monitoring seems to center around the notion of a consumer group.  But in
the receiverless approach, code on the Spark consumer side doesn't seem to
expose a consumer group parameter.  Where does it go?  Can I/should I just
pass in group.id as part of the kafkaParams HashMap?

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-monitor-Spark-Streaming-from-Kafka-tp23103.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Dmitry Goldenberg
Got it, thank you, Tathagata and Ted.

Could you comment on my other question
<http://apache-spark-user-list.1001560.n3.nabble.com/Autoscaling-Spark-cluster-based-on-topic-sizes-rate-of-growth-in-Kafka-or-Spark-s-metrics-tt23062.html>
as well?  Basically, I'm trying to get a handle on a good approach to
throttling, on the one hand side, and autoscaling the cluster, on the
other.  Are there any recommended approaches or design patterns for
autoscaling that you have implemented or could point me at? Thanks!

On Wed, May 27, 2015 at 8:08 PM, Tathagata Das  wrote:

> You can throttle the no receiver direct Kafka stream using
> spark.streaming.kafka.maxRatePerPartition
> <http://spark.apache.org/docs/latest/configuration.html#spark-streaming>
>
>
> On Wed, May 27, 2015 at 4:34 PM, Ted Yu  wrote:
>
>> Have you seen
>> http://stackoverflow.com/questions/29051579/pausing-throttling-spark-spark-streaming-application
>> ?
>>
>> Cheers
>>
>> On Wed, May 27, 2015 at 4:11 PM, dgoldenberg 
>> wrote:
>>
>>> Hi,
>>>
>>> With the no receivers approach to streaming from Kafka, is there a way to
>>> set something like spark.streaming.receiver.maxRate so as not to
>>> overwhelm
>>> the Spark consumers?
>>>
>>> What would be some of the ways to throttle the streamed messages so that
>>> the
>>> consumers don't run out of memory?
>>>
>>>
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-from-Kafka-no-receivers-and-spark-streaming-receiver-maxRate-tp23061.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Tathagata Das
You can throttle the no receiver direct Kafka stream using
spark.streaming.kafka.maxRatePerPartition
<http://spark.apache.org/docs/latest/configuration.html#spark-streaming>


On Wed, May 27, 2015 at 4:34 PM, Ted Yu  wrote:

> Have you seen
> http://stackoverflow.com/questions/29051579/pausing-throttling-spark-spark-streaming-application
> ?
>
> Cheers
>
> On Wed, May 27, 2015 at 4:11 PM, dgoldenberg 
> wrote:
>
>> Hi,
>>
>> With the no receivers approach to streaming from Kafka, is there a way to
>> set something like spark.streaming.receiver.maxRate so as not to overwhelm
>> the Spark consumers?
>>
>> What would be some of the ways to throttle the streamed messages so that
>> the
>> consumers don't run out of memory?
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-from-Kafka-no-receivers-and-spark-streaming-receiver-maxRate-tp23061.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread Ted Yu
Have you seen
http://stackoverflow.com/questions/29051579/pausing-throttling-spark-spark-streaming-application
?

Cheers

On Wed, May 27, 2015 at 4:11 PM, dgoldenberg 
wrote:

> Hi,
>
> With the no receivers approach to streaming from Kafka, is there a way to
> set something like spark.streaming.receiver.maxRate so as not to overwhelm
> the Spark consumers?
>
> What would be some of the ways to throttle the streamed messages so that
> the
> consumers don't run out of memory?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-from-Kafka-no-receivers-and-spark-streaming-receiver-maxRate-tp23061.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Spark Streaming from Kafka - no receivers and spark.streaming.receiver.maxRate?

2015-05-27 Thread dgoldenberg
Hi,

With the no receivers approach to streaming from Kafka, is there a way to
set something like spark.streaming.receiver.maxRate so as not to overwhelm
the Spark consumers?

What would be some of the ways to throttle the streamed messages so that the
consumers don't run out of memory?





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-from-Kafka-no-receivers-and-spark-streaming-receiver-maxRate-tp23061.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: spark streaming from kafka real time + batch processing in java

2015-02-06 Thread Andrew Psaltis
Mohit,

>I want to process the data in real-time as well as store the data in hdfs
in year/month/day/hour/ format.
Are you wanting to process it and then put it into HDFS or just put the raw
data into HDFS? If the later then why not just use Camus (
https://github.com/linkedin/camus), it will easily put the data into the
directory structure you are after.

On Fri, Feb 6, 2015 at 12:19 AM, Mohit Durgapal 
wrote:

> I want to write a spark streaming consumer for kafka in java. I want to
> process the data in real-time as well as store the data in hdfs in
> year/month/day/hour/ format. I am not sure how to achieve this. Should I
> write separate kafka consumers, one for writing data to HDFS and one for
> spark streaming?
>
> Also I would like to ask what do people generally do with the result of
> spark streams after aggregating over it? Is it okay to update a NoSQL DB
> with aggregated counts per batch interval or is it generally stored in hdfs?
>
> Is it possible to store the mini batch data from spark streaming to HDFS
> in a way that the data is aggregated  hourly and put into HDFS in its
> "hour" folder. I would not want a lot of small files equal to the mini
> batches of spark per hour, that would be inefficient for running hadoop
> jobs later.
>
> Is anyone working on the same problem?
>
> Any help and comments would be great.
>
>
> Regards
> Mohit
>


Re: spark streaming from kafka real time + batch processing in java

2015-02-06 Thread Charles Feduke
Good questions, some of which I'd like to know the answer to.

>>  Is it okay to update a NoSQL DB with aggregated counts per batch
interval or is it generally stored in hdfs?

This depends on how you are going to use the aggregate data.

1. Is there a lot of data? If so, and you are going to use the data as
inputs to another job, it might benefit from being distributed across the
cluster on HDFS (for data locality).
2. Usually when speaking about aggregates there is be substantially less
data, in which case storing that data in another datastore is okay. If
you're talking about a few thousand rows, and having them in something like
Mongo or Postgres makes your life easier (reporting software, for example)
- even if you use them as inputs to another job - its okay to just store
the results in another data store. If the data will grow unbounded over
time this might not be a good solution (in which case refer to #1).



On Fri Feb 06 2015 at 6:16:39 AM Mohit Durgapal 
wrote:

> I want to write a spark streaming consumer for kafka in java. I want to
> process the data in real-time as well as store the data in hdfs in
> year/month/day/hour/ format. I am not sure how to achieve this. Should I
> write separate kafka consumers, one for writing data to HDFS and one for
> spark streaming?
>
> Also I would like to ask what do people generally do with the result of
> spark streams after aggregating over it? Is it okay to update a NoSQL DB
> with aggregated counts per batch interval or is it generally stored in hdfs?
>
> Is it possible to store the mini batch data from spark streaming to HDFS
> in a way that the data is aggregated  hourly and put into HDFS in its
> "hour" folder. I would not want a lot of small files equal to the mini
> batches of spark per hour, that would be inefficient for running hadoop
> jobs later.
>
> Is anyone working on the same problem?
>
> Any help and comments would be great.
>
>
> Regards
>
> Mohit
>


spark streaming from kafka real time + batch processing in java

2015-02-06 Thread Mohit Durgapal
I want to write a spark streaming consumer for kafka in java. I want to
process the data in real-time as well as store the data in hdfs in
year/month/day/hour/ format. I am not sure how to achieve this. Should I
write separate kafka consumers, one for writing data to HDFS and one for
spark streaming?

Also I would like to ask what do people generally do with the result of
spark streams after aggregating over it? Is it okay to update a NoSQL DB
with aggregated counts per batch interval or is it generally stored in hdfs?

Is it possible to store the mini batch data from spark streaming to HDFS in
a way that the data is aggregated  hourly and put into HDFS in its "hour"
folder. I would not want a lot of small files equal to the mini batches of
spark per hour, that would be inefficient for running hadoop jobs later.

Is anyone working on the same problem?

Any help and comments would be great.


Regards

Mohit


spark streaming from kafka real time + batch processing in java

2015-02-05 Thread Mohit Durgapal
I want to write a spark streaming consumer for kafka in java. I want to
process the data in real-time as well as store the data in hdfs in
year/month/day/hour/ format. I am not sure how to achieve this. Should I
write separate kafka consumers, one for writing data to HDFS and one for
spark streaming?

Also I would like to ask what do people generally do with the result of
spark streams after aggregating over it? Is it okay to update a NoSQL DB
with aggregated counts per batch interval or is it generally stored in hdfs?

Is it possible to store the mini batch data from spark streaming to HDFS in
a way that the data is aggregated  hourly and put into HDFS in its "hour"
folder. I would not want a lot of small files equal to the mini batches of
spark per hour, that would be inefficient for running hadoop jobs later.

Is anyone working on the same problem?

Any help and comments would be great.


Regards
Mohit


Re: Spark Streaming from Kafka

2014-10-29 Thread harold
I using kafka_2.10-1.1.0.jar on spark 1.1.0

—
Sent from Mailbox

On Wed, Oct 29, 2014 at 12:31 AM, null  wrote:

> Thanks! How do I find out which Kafka jar to use for scala 2.10.4?
> —
> Sent from Mailbox
> On Wed, Oct 29, 2014 at 12:26 AM, Akhil Das 
> wrote:
>> Looks like the kafka jar that you are using isn't compatible with your
>> scala version.
>> Thanks
>> Best Regards
>> On Wed, Oct 29, 2014 at 11:50 AM, Harold Nguyen  wrote:
>>> Hi,
>>>
>>> Just wondering if you've seen the following error when reading from Kafka:
>>>
>>> ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting
>>> receiver 0 - java.lang.NoClassDefFoundError: scala/reflect/ClassManifest
>>> at kafka.utils.Log4jController$.(Log4jController.scala:29)
>>> at kafka.utils.Log4jController$.(Log4jController.scala)
>>> at kafka.utils.Logging$class.$init$(Logging.scala:29)
>>> at kafka.utils.VerifiableProperties.(VerifiableProperties.scala:24)
>>> at kafka.consumer.ConsumerConfig.(ConsumerConfig.scala:78)
>>> at
>>> org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:96)
>>> at
>>> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
>>> at
>>> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
>>> at
>>> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
>>> at
>>> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
>>> at
>>> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
>>> at
>>> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>>> at org.apache.spark.scheduler.Task.run(Task.scala:54)
>>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>> at
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>>> at java.lang.Thread.run(Thread.java:744)
>>> Caused by: java.lang.ClassNotFoundException: scala.reflect.ClassManifest
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>>> at java.security.AccessController.doPrivileged(Native Method)
>>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>>> ... 18 more
>>>
>>> Thanks,
>>>
>>> Harold
>>>

Re: Spark Streaming from Kafka

2014-10-29 Thread harold
Thanks! How do I find out which Kafka jar to use for scala 2.10.4?

—
Sent from Mailbox

On Wed, Oct 29, 2014 at 12:26 AM, Akhil Das 
wrote:

> Looks like the kafka jar that you are using isn't compatible with your
> scala version.
> Thanks
> Best Regards
> On Wed, Oct 29, 2014 at 11:50 AM, Harold Nguyen  wrote:
>> Hi,
>>
>> Just wondering if you've seen the following error when reading from Kafka:
>>
>> ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting
>> receiver 0 - java.lang.NoClassDefFoundError: scala/reflect/ClassManifest
>> at kafka.utils.Log4jController$.(Log4jController.scala:29)
>> at kafka.utils.Log4jController$.(Log4jController.scala)
>> at kafka.utils.Logging$class.$init$(Logging.scala:29)
>> at kafka.utils.VerifiableProperties.(VerifiableProperties.scala:24)
>> at kafka.consumer.ConsumerConfig.(ConsumerConfig.scala:78)
>> at
>> org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:96)
>> at
>> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
>> at
>> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
>> at
>> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
>> at
>> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
>> at
>> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
>> at
>> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
>> at org.apache.spark.scheduler.Task.run(Task.scala:54)
>> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
>> at
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:744)
>> Caused by: java.lang.ClassNotFoundException: scala.reflect.ClassManifest
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
>> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
>> at java.security.AccessController.doPrivileged(Native Method)
>> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
>> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
>> ... 18 more
>>
>> Thanks,
>>
>> Harold
>>

Re: Spark Streaming from Kafka

2014-10-29 Thread Akhil Das
Looks like the kafka jar that you are using isn't compatible with your
scala version.

Thanks
Best Regards

On Wed, Oct 29, 2014 at 11:50 AM, Harold Nguyen  wrote:

> Hi,
>
> Just wondering if you've seen the following error when reading from Kafka:
>
> ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting
> receiver 0 - java.lang.NoClassDefFoundError: scala/reflect/ClassManifest
> at kafka.utils.Log4jController$.(Log4jController.scala:29)
> at kafka.utils.Log4jController$.(Log4jController.scala)
> at kafka.utils.Logging$class.$init$(Logging.scala:29)
> at kafka.utils.VerifiableProperties.(VerifiableProperties.scala:24)
> at kafka.consumer.ConsumerConfig.(ConsumerConfig.scala:78)
> at
> org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:96)
> at
> org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
> at
> org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
> at
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
> at
> org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
> at
> org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:744)
> Caused by: java.lang.ClassNotFoundException: scala.reflect.ClassManifest
> at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> ... 18 more
>
> Thanks,
>
> Harold
>


Spark Streaming from Kafka

2014-10-28 Thread Harold Nguyen
Hi,

Just wondering if you've seen the following error when reading from Kafka:

ERROR ReceiverTracker: Deregistered receiver for stream 0: Error starting
receiver 0 - java.lang.NoClassDefFoundError: scala/reflect/ClassManifest
at kafka.utils.Log4jController$.(Log4jController.scala:29)
at kafka.utils.Log4jController$.(Log4jController.scala)
at kafka.utils.Logging$class.$init$(Logging.scala:29)
at kafka.utils.VerifiableProperties.(VerifiableProperties.scala:24)
at kafka.consumer.ConsumerConfig.(ConsumerConfig.scala:78)
at
org.apache.spark.streaming.kafka.KafkaReceiver.onStart(KafkaInputDStream.scala:96)
at
org.apache.spark.streaming.receiver.ReceiverSupervisor.startReceiver(ReceiverSupervisor.scala:121)
at
org.apache.spark.streaming.receiver.ReceiverSupervisor.start(ReceiverSupervisor.scala:106)
at
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:264)
at
org.apache.spark.streaming.scheduler.ReceiverTracker$ReceiverLauncher$$anonfun$9.apply(ReceiverTracker.scala:257)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1121)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.lang.ClassNotFoundException: scala.reflect.ClassManifest
at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
... 18 more

Thanks,

Harold