This was what I was thinking but wanted to verify. Thanks Sean! On Fri, Feb 27, 2015 at 9:56 PM, Sean Owen <so...@cloudera.com> wrote:
> The coarsest level at which you can parallelize is topic. Topics are > all but unrelated to each other so can be consumed independently. But > you can parallelize within the context of a topic too. > > A Kafka group ID defines a consumer group. One consumer in a group > receive each message to the topic that group is listening to. Topics > can have partitions too. You can thus make N consumers in a group > listening to N partitions and each will effectively be listening to a > partition. > > Yes, my understanding is that multiple receivers in one group are the > way to consume a topic's partitions in parallel. > > On Sat, Feb 28, 2015 at 12:56 AM, Corey Nolet <cjno...@gmail.com> wrote: > > Looking @ [1], it seems to recommend pull from multiple Kafka topics in > > order to parallelize data received from Kafka over multiple nodes. I > notice > > in [2], however, that one of the createConsumer() functions takes a > groupId. > > So am I understanding correctly that creating multiple DStreams with the > > same groupId allow data to be partitioned across many nodes on a single > > topic? > > > > [1] > > > http://spark.apache.org/docs/1.2.0/streaming-programming-guide.html#level-of-parallelism-in-data-receiving > > [2] > > > https://spark.apache.org/docs/1.2.0/api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$ >