Re: Is "spark streaming" streaming or mini-batch?

2016-08-24 Thread Mich Talebzadeh
Is "spark streaming" streaming or mini-batch?

I look at something Like Complex Event Processing (CEP) which is a leader
use case for data streaming (and I am experimenting with Spark for it) and
in the realm of CEP there is really no such thing as continuous data
streaming. The point is that when the source sends data out, it is never
truly continuous. What is happening is that "discrete digital messages" are
sent out.  This is in contrast to FM radio Signal or sinusoidal waves that
are continuous analog signals.  However, in the world of CEP, the digital
data which will always be sent as bytes and typically with bytes grouped
into messages as an Event Driven signal.

For certain streaming, the use of Spark is perfectly OK (discarding Flink
and other stuff around).

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 24 August 2016 at 10:40, Steve Loughran <ste...@hortonworks.com> wrote:

>
> On 23 Aug 2016, at 17:58, Mich Talebzadeh <mich.talebza...@gmail.com>
> wrote:
>
> In general depending what you are doing you can tighten above parameters.
> For example if you are using Spark Streaming for Anti-fraud detection, you
> may stream data in at 2 seconds batch interval, Keep your windows length at
> 4 seconds and your sliding intervall = 2 seconds which gives you a kind of
> tight streaming. You are aggregating data that you are collecting over the
> batch Window.
>
>
> I should warn that in https://github.com/apache/spark/pull/14731 I've
> been trying to speed up input scanning against object stores, and
> collecting numbers on the way
>
> *if you are using the FileInputDStream to scan s3, azure (and persumably
> gcs) object stores for data, the time to scan a moderately complex
> directory tree is going to be measurable in seconds*
>
> It's going to depend on distance from the object store and number of
> files, but you'll probably need to use a bigger window
>
> (that patch for SPARK-17159 should improve things ... I'd love some people
> to help by testing it or emailing me direct with any (anonymised) list of
> what their directory structures used in object store FileInputDStream
> streams that I could regenerate for inclusion in some performance tests.
>
>
>


Re: Is "spark streaming" streaming or mini-batch?

2016-08-24 Thread Steve Loughran

On 23 Aug 2016, at 17:58, Mich Talebzadeh 
> wrote:

In general depending what you are doing you can tighten above parameters. For 
example if you are using Spark Streaming for Anti-fraud detection, you may 
stream data in at 2 seconds batch interval, Keep your windows length at 4 
seconds and your sliding intervall = 2 seconds which gives you a kind of tight 
streaming. You are aggregating data that you are collecting over the  batch 
Window.

I should warn that in https://github.com/apache/spark/pull/14731 I've been 
trying to speed up input scanning against object stores, and collecting numbers 
on the way

*if you are using the FileInputDStream to scan s3, azure (and persumably gcs) 
object stores for data, the time to scan a moderately complex directory tree is 
going to be measurable in seconds*

It's going to depend on distance from the object store and number of files, but 
you'll probably need to use a bigger window

(that patch for SPARK-17159 should improve things ... I'd love some people to 
help by testing it or emailing me direct with any (anonymised) list of what 
their directory structures used in object store FileInputDStream streams that I 
could regenerate for inclusion in some performance tests.




Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Matei Zaharia
I think people explained this pretty well, but in practice, this distinction is 
also somewhat of a marketing term, because every system will perform some kind 
of batching. For example, every time you use TCP, the OS and network stack may 
buffer multiple messages together and send them at once; and likewise, 
virtually all streaming engines can batch data internally to achieve higher 
throughput. Furthermore, in all APIs, you can see individual records and 
respond to them one by one. The main question is just what overall performance 
you get (throughput and latency).

Matei

> On Aug 23, 2016, at 4:08 PM, Aseem Bansal <asmbans...@gmail.com> wrote:
> 
> Thanks everyone for clarifying.
> 
> On Tue, Aug 23, 2016 at 9:11 PM, Aseem Bansal <asmbans...@gmail.com 
> <mailto:asmbans...@gmail.com>> wrote:
> I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/ 
> <https://www.inovex.de/blog/storm-in-a-teacup/> and it mentioned that spark 
> streaming actually mini-batch not actual streaming. 
> 
> I have not used streaming and I am not sure what is the difference in the 2 
> terms. Hence could not make a judgement myself.
> 



Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Aseem Bansal
Thanks everyone for clarifying.

On Tue, Aug 23, 2016 at 9:11 PM, Aseem Bansal <asmbans...@gmail.com> wrote:

> I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/
> and it mentioned that spark streaming actually mini-batch not actual
> streaming.
>
> I have not used streaming and I am not sure what is the difference in the
> 2 terms. Hence could not make a judgement myself.
>


Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Mich Talebzadeh
Russell Is correct here.

micro-batch means it does processing within a window. In general there are
three things here.

batch window

This is the basic interval at which the system with receive the data in
batches. This is the interval set when creating a StreamingContext. For
example, if you set the batch interval as 30 seconds, then any input
DStream will generate RDDs of received data at 30 second intervals.

Within streaming you have what is called "a window operator" which is
defined by two parameters -

- WindowDuration / WindowsLength - the length of the window
- SlideDuration / SlidingInterval - the interval at which the window will
slide or move forward

Example

batch window = 30 secconds
window length = 10 minutes
sliding interval = 5 minutes

In that case, you would be creating an output every 5 minutes, aggregating
data that you were collecting every 30 seconds over a previous 10
minutes period of time

In general depending what you are doing you can tighten above parameters.
For example if you are using Spark Streaming for Anti-fraud detection, you
may stream data in at 2 seconds batch interval, Keep your windows length at
4 seconds and your sliding intervall = 2 seconds which gives you a kind of
tight streaming. You are aggregating data that you are collecting over the
batch Window.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 23 August 2016 at 17:34, Russell Spitzer <russell.spit...@gmail.com>
wrote:

> Spark streaming does not process 1 event at a time which is in general I
> think what people call "Streaming." It instead processes groups of events.
> Each group is a "MicroBatch" that gets processed at the same time.
>
> Streaming theoretically always has better latency because the event is
> processed as soon as it arrives. While in microbatching the latency of all
> the events in the batch can be no better than the last element to arrive.
>
> Streaming theoretically has worse performance because events cannot be
> processed in bulk.
>
> In practice throughput and latency are very implementation dependent
>
> On Tue, Aug 23, 2016 at 8:41 AM Aseem Bansal <asmbans...@gmail.com> wrote:
>
>> I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/
>> and it mentioned that spark streaming actually mini-batch not actual
>> streaming.
>>
>> I have not used streaming and I am not sure what is the difference in the
>> 2 terms. Hence could not make a judgement myself.
>>
>


Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Russell Spitzer
Spark streaming does not process 1 event at a time which is in general I
think what people call "Streaming." It instead processes groups of events.
Each group is a "MicroBatch" that gets processed at the same time.

Streaming theoretically always has better latency because the event is
processed as soon as it arrives. While in microbatching the latency of all
the events in the batch can be no better than the last element to arrive.

Streaming theoretically has worse performance because events cannot be
processed in bulk.

In practice throughput and latency are very implementation dependent

On Tue, Aug 23, 2016 at 8:41 AM Aseem Bansal <asmbans...@gmail.com> wrote:

> I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/
> and it mentioned that spark streaming actually mini-batch not actual
> streaming.
>
> I have not used streaming and I am not sure what is the difference in the
> 2 terms. Hence could not make a judgement myself.
>


Re: Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread pandees waran
It's based on "micro batching" model.

Sent from my iPhone

> On Aug 23, 2016, at 8:41 AM, Aseem Bansal <asmbans...@gmail.com> wrote:
> 
> I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/ and 
> it mentioned that spark streaming actually mini-batch not actual streaming. 
> 
> I have not used streaming and I am not sure what is the difference in the 2 
> terms. Hence could not make a judgement myself.


Is "spark streaming" streaming or mini-batch?

2016-08-23 Thread Aseem Bansal
I was reading this article https://www.inovex.de/blog/storm-in-a-teacup/
and it mentioned that spark streaming actually mini-batch not actual
streaming.

I have not used streaming and I am not sure what is the difference in the 2
terms. Hence could not make a judgement myself.