Re: Kafka DStream Parallelism

Corey Nolet Fri, 27 Feb 2015 19:01:11 -0800

This was what I was thinking but wanted to verify. Thanks Sean!

On Fri, Feb 27, 2015 at 9:56 PM, Sean Owen <so...@cloudera.com> wrote:


> The coarsest level at which you can parallelize is topic. Topics are
> all but unrelated to each other so can be consumed independently. But
> you can parallelize within the context of a topic too.
>
> A Kafka group ID defines a consumer group. One consumer in a group
> receive each message to the topic that group is listening to. Topics
> can have partitions too. You can thus make N consumers in a group
> listening to N partitions and each will effectively be listening to a
> partition.
>
> Yes, my understanding is that multiple receivers in one group are the
> way to consume a topic's partitions in parallel.
>
> On Sat, Feb 28, 2015 at 12:56 AM, Corey Nolet <cjno...@gmail.com> wrote:
> > Looking @ [1], it seems to recommend pull from multiple Kafka topics in
> > order to parallelize data received from Kafka over multiple nodes. I
> notice
> > in [2], however, that one of the createConsumer() functions takes a
> groupId.
> > So am I understanding correctly that creating multiple DStreams with the
> > same groupId allow data to be partitioned across many nodes on a single
> > topic?
> >
> > [1]
> >
> http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#level-of-parallelism-in-data-receiving
> > [2]
> >
> https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$
>

Re: Kafka DStream Parallelism

Reply via email to