The rdd partitions are 1:1 with kafka topicpartitions, so you can use offsets ranges to figure out which topic a given rdd partition is for and proceed accordingly. See the kafka integration guide in the spark streaming docs for more details, or https://github.com/koeninger/kafka-exactly-once
As far as setting offsets in ZK, there's a private interface in the spark codebase that would make it a little easier for you to do that. You can see that code for reference, or there's an outstanding ticket for making it public https://issues.apache.org/jira/browse/SPARK-10963 On Wed, Oct 21, 2015 at 1:50 PM, Dave Ariens <dari...@blackberry.com> wrote: > Hey folks, > > > > I have a very large number of Kafka topics (many thousands of partitions) > that I want to consume, filter based on topic-specific filters, then > produce back to filtered topics in Kafka. > > > > Using the receiver-less based approach with Spark 1.4.1 (described here > <https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md>) > I am able to use either KafkaUtils.createDirectStream or > KafkaUtils.createRDD, consume from many topics, and filter them with the > same filters but I can't seem to wrap my head around how to apply > topic-specific filters, or to finally produce to topic-specific destination > topics. > > > > Another point would be that I will need to checkpoint the metadata after > each successful batch and set starting offsets per partition back to ZK. I > expect I can do that on the final RDDs after casting them accordingly, but > if anyone has any expertise/guidance doing that and is willing to share, > I'd be pretty grateful. >