Latest enhancement in Low Level Receiver based Kafka Consumer

2015-04-01 Thread Dibyendu Bhattacharya
Hi,

Just to let you know, I have made some enhancement in Low Level Reliable
Receiver based Kafka Consumer (
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer)  .

Earlier version uses as many Receiver task for number of partitions of your
kafka topic . Now you can configure desired number of Receivers task and
every Receiver can handle subset of topic partitions.

There was some use cases where consumer need to handle gigantic topics (
having 100+ partitions ) and using my receiver creates that many Receiver
task and hence that many CPU cores is needed just for Receiver. It was a
issue .


In latest code, I have changed that behavior. The max limit for number of
Receiver is still your number of partition, but if you specify less number
of Receiver task, every receiver will handle a subset of partitions and
consume using Kafka Low Level consumer API.

Every receiver will manages partition(s) offset in ZK as usual way..


You can see the latest consumer here :
http://spark-packages.org/package/dibbhatt/kafka-spark-consumer



Regards,
Dibyendu


Re: Latest enhancement in Low Level Receiver based Kafka Consumer

2015-04-01 Thread Neelesh
Hi Dibyendu,
   Thanks for your work on this project. Spark 1.3 now has direct kafka
streams, but still does not provide enough control over partitions and
topics. For example, the streams are fairly statically configured -
RDD.getPartitions() is computed only once, thus making it difficult to use
in a SaaS environment where topics are created and deactivated on the fly
(one topic per customer, for example). But its easy to build a wrapper
around your receivers.
May be there is a play where one can club direct streams with your
receivers, but I don't quite fully understand how the 1.3 direct streams
work yet

Another thread -  Kafka 0.8.2 supports non ZK offset management , which I
think is more scalable than bombarding ZK. I'm working on supporting the
new offset management strategy for Kafka with kafka-spark-consumer.

Thanks!
-neelesh

On Wed, Apr 1, 2015 at 9:49 AM, Dibyendu Bhattacharya 
dibyendu.bhattach...@gmail.com wrote:

 Hi,

 Just to let you know, I have made some enhancement in Low Level Reliable
 Receiver based Kafka Consumer (
 http://spark-packages.org/package/dibbhatt/kafka-spark-consumer)  .

 Earlier version uses as many Receiver task for number of partitions of
 your kafka topic . Now you can configure desired number of Receivers task
 and every Receiver can handle subset of topic partitions.

 There was some use cases where consumer need to handle gigantic topics (
 having 100+ partitions ) and using my receiver creates that many Receiver
 task and hence that many CPU cores is needed just for Receiver. It was a
 issue .


 In latest code, I have changed that behavior. The max limit for number of
 Receiver is still your number of partition, but if you specify less number
 of Receiver task, every receiver will handle a subset of partitions and
 consume using Kafka Low Level consumer API.

 Every receiver will manages partition(s) offset in ZK as usual way..


 You can see the latest consumer here :
 http://spark-packages.org/package/dibbhatt/kafka-spark-consumer



 Regards,
 Dibyendu