>From experience, I'd recommend using the  dstream.foreachRDD method and
doing the filtering within that context. Extending the example of TD,
something like this:

dstream.foreachRDD { rdd =>
   rdd.cache()
   messageType.foreach (msgTyp =>
       val selection = rdd.filter(msgTyp.match(_))
        selection.foreach { ... }
    }
   rdd.unpersist()
}

I would discourage the use of:

MessageType1DStream = MainDStream.filter(message type1)

MessageType2DStream = MainDStream.filter(message type2)

MessageType3DStream = MainDStream.filter(message type3)

Because it will be a lot more work to process on the spark side.
Each DSteam will schedule tasks for each partition, resulting in #dstream x
#partitions x #stages tasks instead of the #partitions x #stages with the
approach presented above.


-kr, Gerard.

On Thu, Apr 16, 2015 at 10:57 AM, Evo Eftimov <evo.efti...@isecc.com> wrote:

> And yet another way is to demultiplex at one point which will yield
> separate DStreams for each message type which you can then process in
> independent DAG pipelines in the following way:
>
>
>
> MessageType1DStream = MainDStream.filter(message type1)
>
> MessageType2DStream = MainDStream.filter(message type2)
>
> MessageType3DStream = MainDStream.filter(message type3)
>
>
>
> Then proceed your processing independently with MessageType1DStream,
> MessageType2DStream and MessageType3DStream ie each of them is a starting
> point of a new DAG pipeline running in parallel
>
>
>
> *From:* Tathagata Das [mailto:t...@databricks.com]
> *Sent:* Thursday, April 16, 2015 12:52 AM
> *To:* Jianshi Huang
> *Cc:* user; Shao, Saisai; Huang Jie
> *Subject:* Re: How to do dispatching in Streaming?
>
>
>
> It may be worthwhile to do architect the computation in a different way.
>
>
>
> dstream.foreachRDD { rdd =>
>
>    rdd.foreach { record =>
>
>       // do different things for each record based on filters
>
>    }
>
> }
>
>
>
> TD
>
>
>
> On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang <jianshi.hu...@gmail.com>
> wrote:
>
> Hi,
>
>
>
> I have a Kafka topic that contains dozens of different types of messages.
> And for each one I'll need to create a DStream for it.
>
>
>
> Currently I have to filter the Kafka stream over and over, which is very
> inefficient.
>
>
>
> So what's the best way to do dispatching in Spark Streaming? (one DStream
> -> multiple DStreams)
>
>
>
>
> Thanks,
>
> --
>
> Jianshi Huang
>
> LinkedIn: jianshi
> Twitter: @jshuang
> Github & Blog: http://huangjs.github.com/
>
>
>

Reply via email to