The rdd partitions are 1:1 with kafka topicpartitions, so you can use
offsets ranges to figure out which topic a given rdd partition is for and
proceed accordingly.  See the kafka integration guide in the spark
streaming docs for more details, or
https://github.com/koeninger/kafka-exactly-once

As far as setting offsets in ZK, there's a private interface in the spark
codebase that would make it a little easier for you to do that.  You can
see that code for reference, or there's an outstanding ticket for making it
public https://issues.apache.org/jira/browse/SPARK-10963

On Wed, Oct 21, 2015 at 1:50 PM, Dave Ariens <dari...@blackberry.com> wrote:

> Hey folks,
>
>
>
> I have a very large number of Kafka topics (many thousands of partitions)
> that I want to consume, filter based on topic-specific filters, then
> produce back to filtered topics in Kafka.
>
>
>
> Using the receiver-less based approach with Spark 1.4.1 (described here
> <https://github.com/koeninger/kafka-exactly-once/blob/master/blogpost.md>)
> I am able to use either KafkaUtils.createDirectStream or
> KafkaUtils.createRDD, consume from many topics, and filter them with the
> same filters but I can't seem to wrap my head around how to apply
> topic-specific filters, or to finally produce to topic-specific destination
> topics.
>
>
>
> Another point would be that I will need to checkpoint the metadata after
> each successful batch and set starting offsets per partition back to ZK.  I
> expect I can do that on the final RDDs after casting them accordingly, but
> if anyone has any expertise/guidance doing that and is willing to share,
> I'd be pretty grateful.
>

Reply via email to