On top of that you could make the topic part of the key (e.g. keyBy in 
.transform or manually emitting a tuple) and use one of the .xxxByKey operators 
for the processing.

If you have a stable, domain specific list of topics (e.g. 3-5 named topics) 
and the processing is really different, I would also look at filtering by topic 
and saving as different Dstreams in your code.

Either way you need to start with Cody’s tip in order to extract the topic name.

-adrian

From: Cody Koeninger
Date: Thursday, October 1, 2015 at 5:06 PM
To: Udit Mehta
Cc: user
Subject: Re: Kafka Direct Stream

You can get the topic for a given partition from the offset range.  You can 
either filter using that; or just have a single rdd and match on topic when 
doing mapPartitions or foreachPartition (which I think is a better idea)

http://spark.apache.org/docs/latest/streaming-kafka-integration.html#approach-2-direct-approach-no-receivers

On Wed, Sep 30, 2015 at 5:02 PM, Udit Mehta 
<ume...@groupon.com<mailto:ume...@groupon.com>> wrote:
Hi,

I am using spark direct stream to consume from multiple topics in Kafka. I am 
able to consume fine but I am stuck at how to separate the data for each topic 
since I need to process data differently depending on the topic.
I basically want to split the RDD consisting on N topics into N RDD's each 
having 1 topic.

Any help would be appreciated.

Thanks in advance,
Udit

Reply via email to