RE: How to do dispatching in Streaming?

2015-04-17 Thread Evo Eftimov
Good use of analogies J

 

Yep friction (or entropy in general) exists in everything – but hey by adding 
and doing “more work” at the same time (aka more powerful rockets) some people 
have overcome the friction of the air and even got as far as the moon and 
beyond 

 

It is all about the bottom lime / the big picture – in some models, friction 
can be a huge factor in the equations in some other it is just part of the 
landscape  

 

From: Gerard Maas [mailto:gerard.m...@gmail.com] 
Sent: Friday, April 17, 2015 10:12 AM
To: Evo Eftimov
Cc: Tathagata Das; Jianshi Huang; user; Shao, Saisai; Huang Jie
Subject: Re: How to do dispatching in Streaming?

 

Evo,

 

In Spark there's a fixed scheduling cost for each task, so more tasks mean an 
increased bottom line for the same amount of work being done. The number of 
tasks per batch interval should relate to the CPU resources available for the 
job following the same 'rule of thumbs' than for Spark, being 2-3 times the #of 
cores.  

 

In that physical model presented before, I think we could consider this 
scheduling cost as a form of friction.

 

-kr, Gerard.

 

On Thu, Apr 16, 2015 at 11:47 AM, Evo Eftimov evo.efti...@isecc.com wrote:

Ooops – what does “more work” mean in a Parallel Programming paradigm and does 
it always translate in “inefficiency” 

 

Here are a few laws of physics in this space:

 

1.   More Work if done AT THE SAME time AND fully utilizes the cluster 
resources is a GOOD thing 

2.   More Work which can not be done at the same time and has to be 
processed sequentially is a BAD thing 

 

So the key is whether it is about 1 or 2 and if it is about 1, whether it leads 
to e.g. Higher Throughput and Lower Latency or not 

 

Regards,

Evo Eftimov 

 

From: Gerard Maas [mailto:gerard.m...@gmail.com] 
Sent: Thursday, April 16, 2015 10:41 AM
To: Evo Eftimov
Cc: Tathagata Das; Jianshi Huang; user; Shao, Saisai; Huang Jie


Subject: Re: How to do dispatching in Streaming?

 

From experience, I'd recommend using the  dstream.foreachRDD method and doing 
the filtering within that context. Extending the example of TD, something like 
this:

 

dstream.foreachRDD { rdd =

   rdd.cache()   

   messageType.foreach (msgTyp = 

   val selection = rdd.filter(msgTyp.match(_))

selection.foreach { ... }

}

   rdd.unpersist()

}

 

I would discourage the use of:

MessageType1DStream = MainDStream.filter(message type1)

MessageType2DStream = MainDStream.filter(message type2)

MessageType3DStream = MainDStream.filter(message type3)

 

Because it will be a lot more work to process on the spark side. 

Each DSteam will schedule tasks for each partition, resulting in #dstream x 
#partitions x #stages tasks instead of the #partitions x #stages with the 
approach presented above.

 

 

-kr, Gerard.

 

On Thu, Apr 16, 2015 at 10:57 AM, Evo Eftimov evo.efti...@isecc.com wrote:

And yet another way is to demultiplex at one point which will yield separate 
DStreams for each message type which you can then process in independent DAG 
pipelines in the following way:

 

MessageType1DStream = MainDStream.filter(message type1)

MessageType2DStream = MainDStream.filter(message type2)

MessageType3DStream = MainDStream.filter(message type3)

 

Then proceed your processing independently with MessageType1DStream, 
MessageType2DStream and MessageType3DStream ie each of them is a starting point 
of a new DAG pipeline running in parallel

 

From: Tathagata Das [mailto:t...@databricks.com] 
Sent: Thursday, April 16, 2015 12:52 AM
To: Jianshi Huang
Cc: user; Shao, Saisai; Huang Jie
Subject: Re: How to do dispatching in Streaming?

 

It may be worthwhile to do architect the computation in a different way. 

 

dstream.foreachRDD { rdd = 

   rdd.foreach { record = 

  // do different things for each record based on filters

   }

}

 

TD

 

On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang jianshi.hu...@gmail.com wrote:

Hi,

 

I have a Kafka topic that contains dozens of different types of messages. And 
for each one I'll need to create a DStream for it.

 

Currently I have to filter the Kafka stream over and over, which is very 
inefficient.

 

So what's the best way to do dispatching in Spark Streaming? (one DStream - 
multiple DStreams)

 




Thanks,

-- 

Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/

 

 

 



Re: How to do dispatching in Streaming?

2015-04-17 Thread Jianshi Huang
Thanks everyone for the reply.

Looks like foreachRDD + filtering is the way to go. I'll have 4 independent
Spark streaming applications so the overhead seems acceptable.

Jianshi


On Fri, Apr 17, 2015 at 5:17 PM, Evo Eftimov evo.efti...@isecc.com wrote:

 Good use of analogies J



 Yep friction (or entropy in general) exists in everything – but hey by
 adding and doing “more work” at the same time (aka more powerful rockets)
 some people have overcome the friction of the air and even got as far as
 the moon and beyond



 It is all about the bottom lime / the big picture – in some models,
 friction can be a huge factor in the equations in some other it is just
 part of the landscape



 *From:* Gerard Maas [mailto:gerard.m...@gmail.com]
 *Sent:* Friday, April 17, 2015 10:12 AM

 *To:* Evo Eftimov
 *Cc:* Tathagata Das; Jianshi Huang; user; Shao, Saisai; Huang Jie
 *Subject:* Re: How to do dispatching in Streaming?



 Evo,



 In Spark there's a fixed scheduling cost for each task, so more tasks mean
 an increased bottom line for the same amount of work being done. The number
 of tasks per batch interval should relate to the CPU resources available
 for the job following the same 'rule of thumbs' than for Spark, being 2-3
 times the #of cores.



 In that physical model presented before, I think we could consider this
 scheduling cost as a form of friction.



 -kr, Gerard.



 On Thu, Apr 16, 2015 at 11:47 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 Ooops – what does “more work” mean in a Parallel Programming paradigm and
 does it always translate in “inefficiency”



 Here are a few laws of physics in this space:



 1.   More Work if done AT THE SAME time AND fully utilizes the
 cluster resources is a GOOD thing

 2.   More Work which can not be done at the same time and has to be
 processed sequentially is a BAD thing



 So the key is whether it is about 1 or 2 and if it is about 1, whether it
 leads to e.g. Higher Throughput and Lower Latency or not



 Regards,

 Evo Eftimov



 *From:* Gerard Maas [mailto:gerard.m...@gmail.com]
 *Sent:* Thursday, April 16, 2015 10:41 AM
 *To:* Evo Eftimov
 *Cc:* Tathagata Das; Jianshi Huang; user; Shao, Saisai; Huang Jie


 *Subject:* Re: How to do dispatching in Streaming?



 From experience, I'd recommend using the  dstream.foreachRDD method and
 doing the filtering within that context. Extending the example of TD,
 something like this:



 dstream.foreachRDD { rdd =

rdd.cache()

messageType.foreach (msgTyp =

val selection = rdd.filter(msgTyp.match(_))

 selection.foreach { ... }

 }

rdd.unpersist()

 }



 I would discourage the use of:

 MessageType1DStream = MainDStream.filter(message type1)

 MessageType2DStream = MainDStream.filter(message type2)

 MessageType3DStream = MainDStream.filter(message type3)



 Because it will be a lot more work to process on the spark side.

 Each DSteam will schedule tasks for each partition, resulting in #dstream
 x #partitions x #stages tasks instead of the #partitions x #stages with the
 approach presented above.





 -kr, Gerard.



 On Thu, Apr 16, 2015 at 10:57 AM, Evo Eftimov evo.efti...@isecc.com
 wrote:

 And yet another way is to demultiplex at one point which will yield
 separate DStreams for each message type which you can then process in
 independent DAG pipelines in the following way:



 MessageType1DStream = MainDStream.filter(message type1)

 MessageType2DStream = MainDStream.filter(message type2)

 MessageType3DStream = MainDStream.filter(message type3)



 Then proceed your processing independently with MessageType1DStream,
 MessageType2DStream and MessageType3DStream ie each of them is a starting
 point of a new DAG pipeline running in parallel



 *From:* Tathagata Das [mailto:t...@databricks.com]
 *Sent:* Thursday, April 16, 2015 12:52 AM
 *To:* Jianshi Huang
 *Cc:* user; Shao, Saisai; Huang Jie
 *Subject:* Re: How to do dispatching in Streaming?



 It may be worthwhile to do architect the computation in a different way.



 dstream.foreachRDD { rdd =

rdd.foreach { record =

   // do different things for each record based on filters

}

 }



 TD



 On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi,



 I have a Kafka topic that contains dozens of different types of messages.
 And for each one I'll need to create a DStream for it.



 Currently I have to filter the Kafka stream over and over, which is very
 inefficient.



 So what's the best way to do dispatching in Spark Streaming? (one DStream
 - multiple DStreams)




 Thanks,

 --

 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/










-- 
Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/


RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
And yet another way is to demultiplex at one point which will yield separate 
DStreams for each message type which you can then process in independent DAG 
pipelines in the following way:

 

MessageType1DStream = MainDStream.filter(message type1)

MessageType2DStream = MainDStream.filter(message type2)

MessageType3DStream = MainDStream.filter(message type3)

 

Then proceed your processing independently with MessageType1DStream, 
MessageType2DStream and MessageType3DStream ie each of them is a starting point 
of a new DAG pipeline running in parallel

 

From: Tathagata Das [mailto:t...@databricks.com] 
Sent: Thursday, April 16, 2015 12:52 AM
To: Jianshi Huang
Cc: user; Shao, Saisai; Huang Jie
Subject: Re: How to do dispatching in Streaming?

 

It may be worthwhile to do architect the computation in a different way. 

 

dstream.foreachRDD { rdd = 

   rdd.foreach { record = 

  // do different things for each record based on filters

   }

}

 

TD

 

On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang jianshi.hu...@gmail.com wrote:

Hi,

 

I have a Kafka topic that contains dozens of different types of messages. And 
for each one I'll need to create a DStream for it.

 

Currently I have to filter the Kafka stream over and over, which is very 
inefficient.

 

So what's the best way to do dispatching in Spark Streaming? (one DStream - 
multiple DStreams)

 




Thanks,

-- 

Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/

 



Re: How to do dispatching in Streaming?

2015-04-16 Thread Gerard Maas
From experience, I'd recommend using the  dstream.foreachRDD method and
doing the filtering within that context. Extending the example of TD,
something like this:

dstream.foreachRDD { rdd =
   rdd.cache()
   messageType.foreach (msgTyp =
   val selection = rdd.filter(msgTyp.match(_))
selection.foreach { ... }
}
   rdd.unpersist()
}

I would discourage the use of:

MessageType1DStream = MainDStream.filter(message type1)

MessageType2DStream = MainDStream.filter(message type2)

MessageType3DStream = MainDStream.filter(message type3)

Because it will be a lot more work to process on the spark side.
Each DSteam will schedule tasks for each partition, resulting in #dstream x
#partitions x #stages tasks instead of the #partitions x #stages with the
approach presented above.


-kr, Gerard.

On Thu, Apr 16, 2015 at 10:57 AM, Evo Eftimov evo.efti...@isecc.com wrote:

 And yet another way is to demultiplex at one point which will yield
 separate DStreams for each message type which you can then process in
 independent DAG pipelines in the following way:



 MessageType1DStream = MainDStream.filter(message type1)

 MessageType2DStream = MainDStream.filter(message type2)

 MessageType3DStream = MainDStream.filter(message type3)



 Then proceed your processing independently with MessageType1DStream,
 MessageType2DStream and MessageType3DStream ie each of them is a starting
 point of a new DAG pipeline running in parallel



 *From:* Tathagata Das [mailto:t...@databricks.com]
 *Sent:* Thursday, April 16, 2015 12:52 AM
 *To:* Jianshi Huang
 *Cc:* user; Shao, Saisai; Huang Jie
 *Subject:* Re: How to do dispatching in Streaming?



 It may be worthwhile to do architect the computation in a different way.



 dstream.foreachRDD { rdd =

rdd.foreach { record =

   // do different things for each record based on filters

}

 }



 TD



 On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang jianshi.hu...@gmail.com
 wrote:

 Hi,



 I have a Kafka topic that contains dozens of different types of messages.
 And for each one I'll need to create a DStream for it.



 Currently I have to filter the Kafka stream over and over, which is very
 inefficient.



 So what's the best way to do dispatching in Spark Streaming? (one DStream
 - multiple DStreams)




 Thanks,

 --

 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/





RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
Ooops – what does “more work” mean in a Parallel Programming paradigm and does 
it always translate in “inefficiency” 

 

Here are a few laws of physics in this space:

 

1.   More Work if done AT THE SAME time AND fully utilizes the cluster 
resources is a GOOD thing 

2.   More Work which can not be done at the same time and has to be 
processed sequentially is a BAD thing 

 

So the key is whether it is about 1 or 2 and if it is about 1, whether it leads 
to e.g. Higher Throughput and Lower Latency or not 

 

Regards,

Evo Eftimov 

 

From: Gerard Maas [mailto:gerard.m...@gmail.com] 
Sent: Thursday, April 16, 2015 10:41 AM
To: Evo Eftimov
Cc: Tathagata Das; Jianshi Huang; user; Shao, Saisai; Huang Jie
Subject: Re: How to do dispatching in Streaming?

 

From experience, I'd recommend using the  dstream.foreachRDD method and doing 
the filtering within that context. Extending the example of TD, something like 
this:

 

dstream.foreachRDD { rdd =

   rdd.cache()   

   messageType.foreach (msgTyp = 

   val selection = rdd.filter(msgTyp.match(_))

selection.foreach { ... }

}

   rdd.unpersist()

}

 

I would discourage the use of:

MessageType1DStream = MainDStream.filter(message type1)

MessageType2DStream = MainDStream.filter(message type2)

MessageType3DStream = MainDStream.filter(message type3)

 

Because it will be a lot more work to process on the spark side. 

Each DSteam will schedule tasks for each partition, resulting in #dstream x 
#partitions x #stages tasks instead of the #partitions x #stages with the 
approach presented above.

 

 

-kr, Gerard.

 

On Thu, Apr 16, 2015 at 10:57 AM, Evo Eftimov evo.efti...@isecc.com wrote:

And yet another way is to demultiplex at one point which will yield separate 
DStreams for each message type which you can then process in independent DAG 
pipelines in the following way:

 

MessageType1DStream = MainDStream.filter(message type1)

MessageType2DStream = MainDStream.filter(message type2)

MessageType3DStream = MainDStream.filter(message type3)

 

Then proceed your processing independently with MessageType1DStream, 
MessageType2DStream and MessageType3DStream ie each of them is a starting point 
of a new DAG pipeline running in parallel

 

From: Tathagata Das [mailto:t...@databricks.com] 
Sent: Thursday, April 16, 2015 12:52 AM
To: Jianshi Huang
Cc: user; Shao, Saisai; Huang Jie
Subject: Re: How to do dispatching in Streaming?

 

It may be worthwhile to do architect the computation in a different way. 

 

dstream.foreachRDD { rdd = 

   rdd.foreach { record = 

  // do different things for each record based on filters

   }

}

 

TD

 

On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang jianshi.hu...@gmail.com wrote:

Hi,

 

I have a Kafka topic that contains dozens of different types of messages. And 
for each one I'll need to create a DStream for it.

 

Currently I have to filter the Kafka stream over and over, which is very 
inefficient.

 

So what's the best way to do dispatching in Spark Streaming? (one DStream - 
multiple DStreams)

 




Thanks,

-- 

Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/

 

 



RE: How to do dispatching in Streaming?

2015-04-16 Thread Evo Eftimov
Also you can have each message type in a different topic (needs to be arranged 
upstream from your Spark Streaming app ie in the publishing systems and the 
messaging brokers) and then for each topic you can have a dedicated instance of 
InputReceiverDStream which will be the start of a dedicated DAG pipeline 
instance for every message type. Moreover each such DAG pipeline instance will 
run in parallel with the others 

 

From: Tathagata Das [mailto:t...@databricks.com] 
Sent: Thursday, April 16, 2015 12:52 AM
To: Jianshi Huang
Cc: user; Shao, Saisai; Huang Jie
Subject: Re: How to do dispatching in Streaming?

 

It may be worthwhile to do architect the computation in a different way. 

 

dstream.foreachRDD { rdd = 

   rdd.foreach { record = 

  // do different things for each record based on filters

   }

}

 

TD

 

On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang jianshi.hu...@gmail.com wrote:

Hi,

 

I have a Kafka topic that contains dozens of different types of messages. And 
for each one I'll need to create a DStream for it.

 

Currently I have to filter the Kafka stream over and over, which is very 
inefficient.

 

So what's the best way to do dispatching in Spark Streaming? (one DStream - 
multiple DStreams)

 




Thanks,

-- 

Jianshi Huang

LinkedIn: jianshi
Twitter: @jshuang
Github  Blog: http://huangjs.github.com/

 



Re: How to do dispatching in Streaming?

2015-04-15 Thread Tathagata Das
It may be worthwhile to do architect the computation in a different way.

dstream.foreachRDD { rdd =
   rdd.foreach { record =
  // do different things for each record based on filters
   }
}

TD

On Sun, Apr 12, 2015 at 7:52 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:

 Hi,

 I have a Kafka topic that contains dozens of different types of messages.
 And for each one I'll need to create a DStream for it.

 Currently I have to filter the Kafka stream over and over, which is very
 inefficient.

 So what's the best way to do dispatching in Spark Streaming? (one DStream
 - multiple DStreams)


 Thanks,
 --
 Jianshi Huang

 LinkedIn: jianshi
 Twitter: @jshuang
 Github  Blog: http://huangjs.github.com/