Thanks Chris and Bharat for your inputs. I agree, running multiple receivers/dstreams is desirable for scalability and fault tolerant. and this is easily doable. In present KafkaReceiver I am creating as many threads for each kafka topic partitions, but I can definitely create multiple KafkaReceivers for every partition. As Chris mentioned , in this case I need to then have union of DStreams for all these Receivers. I will try this out and let you know.
Dib On Wed, Aug 27, 2014 at 9:10 AM, Chris Fregly <ch...@fregly.com> wrote: > great work, Dibyendu. looks like this would be a popular contribution. > > expanding on bharat's question a bit: > > what happens if you submit multiple receivers to the cluster by creating > and unioning multiple DStreams as in the kinesis example here: > > > https://github.com/apache/spark/blob/ae58aea2d1435b5bb011e68127e1bcddc2edf5b2/extras/kinesis-asl/src/main/scala/org/apache/spark/examples/streaming/KinesisWordCountASL.scala#L123 > > for more context, the kinesis implementation above uses the Kinesis Client > Library (KCL) to automatically assign - and load balance - stream shards > among all KCL threads from all receivers (potentially coming and going as > nodes die) on all executors/nodes using DynamoDB as the association data > store. > > ZooKeeper would be used for your Kafka consumers, of course. and > ZooKeeper watches to handle the ephemeral nodes. and I see you're using > Curator, which makes things easier. > > as bharat suggested, running multiple receivers/dstreams may be desirable > from a scalability and fault tolerance standpoint. is this type of load > balancing possible among your different Kafka consumers running in > different ephemeral JVMs? > > and isn't it fun proposing a popular piece of code? the question > floodgates have opened! haha. :) > > -chris > > > > On Tue, Aug 26, 2014 at 7:29 AM, Dibyendu Bhattacharya < > dibyendu.bhattach...@gmail.com> wrote: > >> Hi Bharat, >> >> Thanks for your email. If the "Kafka Reader" worker process dies, it will >> be replaced by different machine, and it will start consuming from the >> offset where it left over ( for each partition). Same case can happen even >> if I tried to have individual Receiver for every partition. >> >> Regards, >> Dibyendu >> >> >> On Tue, Aug 26, 2014 at 5:43 AM, bharatvenkat <bvenkat.sp...@gmail.com> >> wrote: >> >>> I like this consumer for what it promises - better control over offset >>> and >>> recovery from failures. If I understand this right, it still uses single >>> worker process to read from Kafka (one thread per partition) - is there a >>> way to specify multiple worker processes (on different machines) to read >>> from Kafka? Maybe one worker process for each partition? >>> >>> If there is no such option, what happens when the single machine hosting >>> the >>> "Kafka Reader" worker process dies and is replaced by a different machine >>> (like in cloud)? >>> >>> Thanks, >>> Bharat >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-user-list.1001560.n3.nabble.com/Low-Level-Kafka-Consumer-for-Spark-tp11258p12788.html >>> Sent from the Apache Spark User List mailing list archive at Nabble.com. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >>> For additional commands, e-mail: user-h...@spark.apache.org >>> >>> >> >