Evo, In Spark there's a fixed scheduling cost for each task, so more tasks mean an increased bottom line for the same amount of work being done. The number of tasks per batch interval should relate to the CPU resources available for the job following the same 'rule of thumbs' than for Spark, being 2-3 times the #of cores.
In that physical model presented before, I think we could consider this scheduling cost as a form of friction. -kr, Gerard. On Thu, Apr 16, 2015 at 11:47 AM, Evo Eftimov <evo.efti...@isecc.com> wrote: > Ooops – what does “more work” mean in a Parallel Programming paradigm and > does it always translate in “inefficiency” > > > > Here are a few laws of physics in this space: > > > > 1. More Work if done AT THE SAME time AND fully utilizes the > cluster resources is a GOOD thing > > 2. More Work which can not be done at the same time and has to be > processed sequentially is a BAD thing > > > > So the key is whether it is about 1 or 2 and if it is about 1, whether it > leads to e.g. Higher Throughput and Lower Latency or not > > > > Regards, > > Evo Eftimov > > > > *From:* Gerard Maas [mailto:gerard.m...@gmail.com] > *Sent:* Thursday, April 16, 2015 10:41 AM > *To:* Evo Eftimov > *Cc:* Tathagata Das; Jianshi Huang; user; Shao, Saisai; Huang Jie > > *Subject:* Re: How to do dispatching in Streaming? > > > > From experience, I'd recommend using the dstream.foreachRDD method and > doing the filtering within that context. Extending the example of TD, > something like this: > > > > dstream.foreachRDD { rdd => > > rdd.cache() > > messageType.foreach (msgTyp => > > val selection = rdd.filter(msgTyp.match(_)) > > selection.foreach { ... } > > } > > rdd.unpersist() > > } > > > > I would discourage the use of: > > MessageType1DStream = MainDStream.filter(message type1) > > MessageType2DStream = MainDStream.filter(message type2) > > MessageType3DStream = MainDStream.filter(message type3) > > > > Because it will be a lot more work to process on the spark side. > > Each DSteam will schedule tasks for each partition, resulting in #dstream > x #partitions x #stages tasks instead of the #partitions x #stages with the > approach presented above. > > > > > > -kr, Gerard. > > > > On Thu, Apr 16, 2015 at 10:57 AM, Evo Eftimov <evo.efti...@isecc.com> > wrote: > > And yet another way is to demultiplex at one point which will yield > separate DStreams for each message type which you can then process in > independent DAG pipelines in the following way: > > > > MessageType1DStream = MainDStream.filter(message type1) > > MessageType2DStream = MainDStream.filter(message type2) > > MessageType3DStream = MainDStream.filter(message type3) > > > > Then proceed your processing independently with MessageType1DStream, > MessageType2DStream and MessageType3DStream ie each of them is a starting > point of a new DAG pipeline running in parallel > > > > *From:* Tathagata Das [mailto:t...@databricks.com] > *Sent:* Thursday, April 16, 2015 12:52 AM > *To:* Jianshi Huang > *Cc:* user; Shao, Saisai; Huang Jie > *Subject:* Re: How to do dispatching in Streaming? > > > > It may be worthwhile to do architect the computation in a different way. > > > > dstream.foreachRDD { rdd => > > rdd.foreach { record => > > // do different things for each record based on filters > > } > > } > > > > TD > > > > On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang <jianshi.hu...@gmail.com> > wrote: > > Hi, > > > > I have a Kafka topic that contains dozens of different types of messages. > And for each one I'll need to create a DStream for it. > > > > Currently I have to filter the Kafka stream over and over, which is very > inefficient. > > > > So what's the best way to do dispatching in Spark Streaming? (one DStream > -> multiple DStreams) > > > > > Thanks, > > -- > > Jianshi Huang > > LinkedIn: jianshi > Twitter: @jshuang > Github & Blog: http://huangjs.github.com/ > > > > >