Re: Is "spark streaming" streaming or mini-batch?

Mich Talebzadeh Wed, 24 Aug 2016 12:12:25 -0700

Is "spark streaming" streaming or mini-batch?

I look at something Like Complex Event Processing (CEP) which is a leader
use case for data streaming (and I am experimenting with Spark for it) and
in the realm of CEP there is really no such thing as continuous data
streaming. The point is that when the source sends data out, it is never
truly continuous. What is happening is that "discrete digital messages" are
sent out.  This is in contrast to FM radio Signal or sinusoidal waves that
are continuous analog signals.  However, in the world of CEP, the digital
data which will always be sent as bytes and typically with bytes grouped
into messages as an Event Driven signal.


For certain streaming, the use of Spark is perfectly OK (discarding Flink
and other stuff around).

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 24 August 2016 at 10:40, Steve Loughran <ste...@hortonworks.com> wrote:

>
> On 23 Aug 2016, at 17:58, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> In general depending what you are doing you can tighten above parameters.
> For example if you are using Spark Streaming for Anti-fraud detection, you
> may stream data in at 2 seconds batch interval, Keep your windows length at
> 4 seconds and your sliding intervall = 2 seconds which gives you a kind of
> tight streaming. You are aggregating data that you are collecting over the
> batch Window.
>
>
> I should warn that in https://github.com/apache/spark/pull/14731 I've
> been trying to speed up input scanning against object stores, and
> collecting numbers on the way
>
> *if you are using the FileInputDStream to scan s3, azure (and persumably
> gcs) object stores for data, the time to scan a moderately complex
> directory tree is going to be measurable in seconds*
>
> It's going to depend on distance from the object store and number of
> files, but you'll probably need to use a bigger window
>
> (that patch for SPARK-17159 should improve things ... I'd love some people
> to help by testing it or emailing me direct with any (anonymised) list of
> what their directory structures used in object store FileInputDStream
> streams that I could regenerate for inclusion in some performance tests.
>
>
>

Re: Is "spark streaming" streaming or mini-batch?

Reply via email to