>From experience, I'd recommend using the dstream.foreachRDD method and doing the filtering within that context. Extending the example of TD, something like this:
dstream.foreachRDD { rdd => rdd.cache() messageType.foreach (msgTyp => val selection = rdd.filter(msgTyp.match(_)) selection.foreach { ... } } rdd.unpersist() } I would discourage the use of: MessageType1DStream = MainDStream.filter(message type1) MessageType2DStream = MainDStream.filter(message type2) MessageType3DStream = MainDStream.filter(message type3) Because it will be a lot more work to process on the spark side. Each DSteam will schedule tasks for each partition, resulting in #dstream x #partitions x #stages tasks instead of the #partitions x #stages with the approach presented above. -kr, Gerard. On Thu, Apr 16, 2015 at 10:57 AM, Evo Eftimov <evo.efti...@isecc.com> wrote: > And yet another way is to demultiplex at one point which will yield > separate DStreams for each message type which you can then process in > independent DAG pipelines in the following way: > > > > MessageType1DStream = MainDStream.filter(message type1) > > MessageType2DStream = MainDStream.filter(message type2) > > MessageType3DStream = MainDStream.filter(message type3) > > > > Then proceed your processing independently with MessageType1DStream, > MessageType2DStream and MessageType3DStream ie each of them is a starting > point of a new DAG pipeline running in parallel > > > > *From:* Tathagata Das [mailto:t...@databricks.com] > *Sent:* Thursday, April 16, 2015 12:52 AM > *To:* Jianshi Huang > *Cc:* user; Shao, Saisai; Huang Jie > *Subject:* Re: How to do dispatching in Streaming? > > > > It may be worthwhile to do architect the computation in a different way. > > > > dstream.foreachRDD { rdd => > > rdd.foreach { record => > > // do different things for each record based on filters > > } > > } > > > > TD > > > > On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang <jianshi.hu...@gmail.com> > wrote: > > Hi, > > > > I have a Kafka topic that contains dozens of different types of messages. > And for each one I'll need to create a DStream for it. > > > > Currently I have to filter the Kafka stream over and over, which is very > inefficient. > > > > So what's the best way to do dispatching in Spark Streaming? (one DStream > -> multiple DStreams) > > > > > Thanks, > > -- > > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > > >