I'm no expert. But as I understand, yes you create multiple streams to
consume multiple partitions in parallel. If they're all in the same
Kafka consumer group, you'll get exactly one copy of the message so
yes if you have 10 consumers and 3 Kafka partitions I believe only 3
will be getting messages.

The parallelism of Spark's processing of the RDDs of those messages is
different. There could be 4 partitions in your RDDs doing the work.
This is the kind of thing you potentially influence with repartition.
That is I believe you can get more tasks processing the messages even
if you are only able to consume messages from the queue with 3-way
parallelism, since the queue has 3 partitions.

On Aug 30, 2014 12:56 AM, "Tim Smith" <secs...@gmail.com> wrote:
>
> Ok, so I did this:
> val kInStreams = (1 to 10).map{_ => 
> KafkaUtils.createStream(ssc,"zkhost1:2181/zk_kafka","testApp",Map("rawunstruct"
>  -> 1)) }
> val kInMsg = ssc.union(kInStreams)
> val outdata = kInMsg.map(x=>normalizeLog(x._2,configMap))
>
> This has improved parallelism. Earlier I would only get a "Stream 0". Now I 
> have "Streams [0-9]". Of course, since the kafka topic has only three 
> partitions, only three of those streams are active but I am seeing more 
> blocks being pulled across the three streams total that what one was doing 
> earlier. Also, four nodes are actively processing tasks (vs only two earlier) 
> now which actually has me confused. If "Streams" are active only on 3 nodes 
> then how/why did a 4th node get work? If a 4th got work why aren't more nodes 
> getting work?
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to