Thanks Chris and Bharat for your inputs. I agree, running multiple
receivers/dstreams is desirable for scalability and fault tolerant. and
this is easily doable. In present KafkaReceiver I am creating as many
threads for each kafka topic partitions, but I can definitely create
multiple KafkaReceivers for every partition. As Chris mentioned , in this
case I need to then have union of DStreams for all these Receivers. I will
try this out and let you know.

Dib


On Wed, Aug 27, 2014 at 9:10 AM, Chris Fregly <ch...@fregly.com> wrote:

> great work, Dibyendu.  looks like this would be a popular contribution.
>
> expanding on bharat's question a bit:
>
> what happens if you submit multiple receivers to the cluster by creating
> and unioning multiple DStreams as in the kinesis example here:
>
>
> https://github.com/apache/spark/blob/ae58aea2d1435b5bb011e68127e1bcddc2edf5b2/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L123
>
> for more context, the kinesis implementation above uses the Kinesis Client
> Library (KCL) to automatically assign - and load balance - stream shards
> among all KCL threads from all receivers (potentially coming and going as
> nodes die) on all executors/nodes using DynamoDB as the association data
> store.
>
> ZooKeeper would be used for your Kafka consumers, of course.  and
> ZooKeeper watches to handle the ephemeral nodes.  and I see you're using
> Curator, which makes things easier.
>
> as bharat suggested, running multiple receivers/dstreams may be desirable
> from a scalability and fault tolerance standpoint.  is this type of load
> balancing possible among your different Kafka consumers running in
> different ephemeral JVMs?
>
> and isn't it fun proposing a popular piece of code?  the question
> floodgates have opened!  haha. :)
>
> -chris
>
>
>
> On Tue, Aug 26, 2014 at 7:29 AM, Dibyendu Bhattacharya <
> dibyendu.bhattach...@gmail.com> wrote:
>
>> Hi Bharat,
>>
>> Thanks for your email. If the "Kafka Reader" worker process dies, it will
>> be replaced by different machine, and it will start consuming from the
>> offset where it left over ( for each partition). Same case can happen even
>> if I tried to have individual Receiver for every partition.
>>
>> Regards,
>> Dibyendu
>>
>>
>> On Tue, Aug 26, 2014 at 5:43 AM, bharatvenkat <bvenkat.sp...@gmail.com>
>> wrote:
>>
>>> I like this consumer for what it promises - better control over offset
>>> and
>>> recovery from failures.  If I understand this right, it still uses single
>>> worker process to read from Kafka (one thread per partition) - is there a
>>> way to specify multiple worker processes (on different machines) to read
>>> from Kafka?  Maybe one worker process for each partition?
>>>
>>> If there is no such option, what happens when the single machine hosting
>>> the
>>> "Kafka Reader" worker process dies and is replaced by a different machine
>>> (like in cloud)?
>>>
>>> Thanks,
>>> Bharat
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p12788.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>

Reply via email to