I'm no expert. But as I understand, yes you create multiple streams to consume multiple partitions in parallel. If they're all in the same Kafka consumer group, you'll get exactly one copy of the message so yes if you have 10 consumers and 3 Kafka partitions I believe only 3 will be getting messages.
The parallelism of Spark's processing of the RDDs of those messages is different. There could be 4 partitions in your RDDs doing the work. This is the kind of thing you potentially influence with repartition. That is I believe you can get more tasks processing the messages even if you are only able to consume messages from the queue with 3-way parallelism, since the queue has 3 partitions. On Aug 30, 2014 12:56 AM, "Tim Smith" <secs...@gmail.com> wrote: > > Ok, so I did this: > val kInStreams = (1 to 10).map{_ => > KafkaUtils.createStream(ssc,"zkhost1:2181/zk_kafka","testApp",Map("rawunstruct" > -> 1)) } > val kInMsg = ssc.union(kInStreams) > val outdata = kInMsg.map(x=>normalizeLog(x._2,configMap)) > > This has improved parallelism. Earlier I would only get a "Stream 0". Now I > have "Streams [0-9]". Of course, since the kafka topic has only three > partitions, only three of those streams are active but I am seeing more > blocks being pulled across the three streams total that what one was doing > earlier. Also, four nodes are actively processing tasks (vs only two earlier) > now which actually has me confused. If "Streams" are active only on 3 nodes > then how/why did a 4th node get work? If a 4th got work why aren't more nodes > getting work? > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org