Re: Explanation regarding Spark Streaming

2016-08-06 Thread Mich Talebzadeh
Thanks.

This is very confusing as the thread owner question does not specify
whether there is windowing operations or not.

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 6 August 2016 at 20:35, Mohammed Guller <moham...@glassbeam.com> wrote:

> According to the docs for Spark Streaming, the default for data received
> through receivers is MEMORY_AND_DISK_SER_2. If windowing operations are
> performed, RDDs are persisted with StorageLevel.MEMORY_ONLY_SER.
>
>
>
> http://spark.apache.org/docs/latest/streaming-programming-
> guide.html#data-serialization
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> <http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>
>
>
>
> *From:* Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
> *Sent:* Saturday, August 6, 2016 12:25 PM
> *To:* Mohammed Guller
> *Cc:* Jacek Laskowski; Saurav Sinha; user
>
> *Subject:* Re: Explanation regarding Spark Streaming
>
>
>
> Hi,
>
>
>
> I think the default storage level
> <http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence>is
> MEMORY_ONLY
>
>
>
> HTH
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn  
> *https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
> On 6 August 2016 at 18:16, Mohammed Guller <moham...@glassbeam.com> wrote:
>
> Hi Jacek,
>
> Yes, I am assuming that data streams in consistently at the same rate (for
> example, 100MB/s).
>
>
>
> BTW, even if the persistence level for streaming data is set to
> MEMORY_AND_DISK_SER_2 (the default), once Spark runs out of memory, data
> will spill to  disk. That will make the application performance even worse.
>
>
>
> Mohammed
>
>
>
> *From:* Jacek Laskowski [mailto:ja...@japila.pl]
> *Sent:* Saturday, August 6, 2016 1:54 AM
> *To:* Mohammed Guller
> *Cc:* Saurav Sinha; user
> *Subject:* RE: Explanation regarding Spark Streaming
>
>
>
> Hi,
>
> Thanks for explanation, but it does not prove Spark will OOM at some
> point. You assume enough data to store but there could be none.
>
> Jacek
>
>
>
> On 6 Aug 2016 4:23 a.m., "Mohammed Guller" <moham...@glassbeam.com> wrote:
>
> Assume the batch interval is 10 seconds and batch processing time is 30
> seconds. So while Spark Streaming is processing the first batch, the
> receiver will have a backlog of 20 seconds worth of data. By the time Spark
> Streaming finishes batch #2, the receiver will have 40 seconds worth of
> data in memory buffer. This backlog will keep growing as time passes
> assuming data streams in consistently at the same rate.
>
> Also keep in mind that windowing operations on a DStream implicitly
> persist every RDD in a DStream in memory.
>
> Mohammed
>
> -Original Message-
> From: Jacek Laskowski [mailto:ja...@japila.pl]
> Sent: Thursday, August 4, 2016 4:25 PM
> To: Mohammed Guller
> Cc: Saurav Sinha; user
> Subject: Re: Explanation regarding Spark Streaming
>
> On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller <moham...@glassbeam.com>
> wrote:
> > and eventually you will run out of memory.
>
> Why? Mind elaborating?
>
> Jacek
>
>
>


RE: Explanation regarding Spark Streaming

2016-08-06 Thread Mohammed Guller
According to the docs for Spark Streaming, the default for data received 
through receivers is MEMORY_AND_DISK_SER_2. If windowing operations are 
performed, RDDs are persisted with StorageLevel.MEMORY_ONLY_SER.

http://spark.apache.org/docs/latest/streaming-programming-guide.html#data-serialization

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Saturday, August 6, 2016 12:25 PM
To: Mohammed Guller
Cc: Jacek Laskowski; Saurav Sinha; user
Subject: Re: Explanation regarding Spark Streaming

Hi,

I think the default storage level 
<http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence> is 
MEMORY_ONLY

HTH




Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



On 6 August 2016 at 18:16, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Hi Jacek,
Yes, I am assuming that data streams in consistently at the same rate (for 
example, 100MB/s).

BTW, even if the persistence level for streaming data is set to 
MEMORY_AND_DISK_SER_2 (the default), once Spark runs out of memory, data will 
spill to  disk. That will make the application performance even worse.

Mohammed

From: Jacek Laskowski [mailto:ja...@japila.pl<mailto:ja...@japila.pl>]
Sent: Saturday, August 6, 2016 1:54 AM
To: Mohammed Guller
Cc: Saurav Sinha; user
Subject: RE: Explanation regarding Spark Streaming


Hi,

Thanks for explanation, but it does not prove Spark will OOM at some point. You 
assume enough data to store but there could be none.

Jacek

On 6 Aug 2016 4:23 a.m., "Mohammed Guller" 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Assume the batch interval is 10 seconds and batch processing time is 30 
seconds. So while Spark Streaming is processing the first batch, the receiver 
will have a backlog of 20 seconds worth of data. By the time Spark Streaming 
finishes batch #2, the receiver will have 40 seconds worth of data in memory 
buffer. This backlog will keep growing as time passes assuming data streams in 
consistently at the same rate.

Also keep in mind that windowing operations on a DStream implicitly persist 
every RDD in a DStream in memory.

Mohammed

-Original Message-
From: Jacek Laskowski [mailto:ja...@japila.pl<mailto:ja...@japila.pl>]
Sent: Thursday, August 4, 2016 4:25 PM
To: Mohammed Guller
Cc: Saurav Sinha; user
Subject: Re: Explanation regarding Spark Streaming

On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
> and eventually you will run out of memory.

Why? Mind elaborating?

Jacek



Re: Explanation regarding Spark Streaming

2016-08-06 Thread Mich Talebzadeh
Hi,

I think the default storage level
<http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence>is
MEMORY_ONLY

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 6 August 2016 at 18:16, Mohammed Guller <moham...@glassbeam.com> wrote:

> Hi Jacek,
>
> Yes, I am assuming that data streams in consistently at the same rate (for
> example, 100MB/s).
>
>
>
> BTW, even if the persistence level for streaming data is set to
> MEMORY_AND_DISK_SER_2 (the default), once Spark runs out of memory, data
> will spill to  disk. That will make the application performance even worse.
>
>
>
> Mohammed
>
>
>
> *From:* Jacek Laskowski [mailto:ja...@japila.pl]
> *Sent:* Saturday, August 6, 2016 1:54 AM
> *To:* Mohammed Guller
> *Cc:* Saurav Sinha; user
> *Subject:* RE: Explanation regarding Spark Streaming
>
>
>
> Hi,
>
> Thanks for explanation, but it does not prove Spark will OOM at some
> point. You assume enough data to store but there could be none.
>
> Jacek
>
>
>
> On 6 Aug 2016 4:23 a.m., "Mohammed Guller" <moham...@glassbeam.com> wrote:
>
> Assume the batch interval is 10 seconds and batch processing time is 30
> seconds. So while Spark Streaming is processing the first batch, the
> receiver will have a backlog of 20 seconds worth of data. By the time Spark
> Streaming finishes batch #2, the receiver will have 40 seconds worth of
> data in memory buffer. This backlog will keep growing as time passes
> assuming data streams in consistently at the same rate.
>
> Also keep in mind that windowing operations on a DStream implicitly
> persist every RDD in a DStream in memory.
>
> Mohammed
>
> -Original Message-
> From: Jacek Laskowski [mailto:ja...@japila.pl]
> Sent: Thursday, August 4, 2016 4:25 PM
> To: Mohammed Guller
> Cc: Saurav Sinha; user
> Subject: Re: Explanation regarding Spark Streaming
>
> On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller <moham...@glassbeam.com>
> wrote:
> > and eventually you will run out of memory.
>
> Why? Mind elaborating?
>
> Jacek
>


RE: Explanation regarding Spark Streaming

2016-08-06 Thread Mohammed Guller
Hi Jacek,
Yes, I am assuming that data streams in consistently at the same rate (for 
example, 100MB/s).

BTW, even if the persistence level for streaming data is set to 
MEMORY_AND_DISK_SER_2 (the default), once Spark runs out of memory, data will 
spill to  disk. That will make the application performance even worse.

Mohammed

From: Jacek Laskowski [mailto:ja...@japila.pl]
Sent: Saturday, August 6, 2016 1:54 AM
To: Mohammed Guller
Cc: Saurav Sinha; user
Subject: RE: Explanation regarding Spark Streaming


Hi,

Thanks for explanation, but it does not prove Spark will OOM at some point. You 
assume enough data to store but there could be none.

Jacek

On 6 Aug 2016 4:23 a.m., "Mohammed Guller" 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Assume the batch interval is 10 seconds and batch processing time is 30 
seconds. So while Spark Streaming is processing the first batch, the receiver 
will have a backlog of 20 seconds worth of data. By the time Spark Streaming 
finishes batch #2, the receiver will have 40 seconds worth of data in memory 
buffer. This backlog will keep growing as time passes assuming data streams in 
consistently at the same rate.

Also keep in mind that windowing operations on a DStream implicitly persist 
every RDD in a DStream in memory.

Mohammed

-Original Message-
From: Jacek Laskowski [mailto:ja...@japila.pl<mailto:ja...@japila.pl>]
Sent: Thursday, August 4, 2016 4:25 PM
To: Mohammed Guller
Cc: Saurav Sinha; user
Subject: Re: Explanation regarding Spark Streaming

On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
> and eventually you will run out of memory.

Why? Mind elaborating?

Jacek


Re: Explanation regarding Spark Streaming

2016-08-06 Thread Mich Talebzadeh
The thread owner question is

Q1. What will happen if spark streaming job have batchDurationTime as 60
sec and processing time of complete pipeline is greater then 60 sec. "


This basically means that you will gradually building a backlog and
regardless of whether you are going to blow up the buffer or not that data
analysis will have serious flow!

In example below I have a batch interval of 2 seconds streaming in 10,000
rows/events. The windows length = 4 sec (twice the batch interval) and
sliding window set at 2 sec. I have deliberately set the volume of
streaming in this case high.

As you can see from the Streaming graphs, there is serious issues here with
average of 213 events/sec and scheduling delay of 5 seconds

[image: Inline images 1]

Technically the app may not crash but its business value is practically
nil. If I was doing this for some form of complex event processing or fraud
detection, I would have to look for a new job :(

So monitor the processing to make sure that it is running smoothly and
ensure that there is no backlog. Also look at your memory usage. For
example, if you are receiving a single stream of 100MB/second, and you want
to do 60 second batches (window length), then you will need a buffer
of 100*60 MB = 6000MB or 6GB at the very least. Note that if you are using
a single receiver, then all the data is coming to a single Spark worker
machine, so each machine should be about that. Add to that other overheads
of running Spark, etc. Accordingly calculate the memory usage, then
double/triple the number to be on the safe side and monitor the processing.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 6 August 2016 at 09:53, Jacek Laskowski <ja...@japila.pl> wrote:

> Hi,
>
> Thanks for explanation, but it does not prove Spark will OOM at some
> point. You assume enough data to store but there could be none.
>
> Jacek
>
> On 6 Aug 2016 4:23 a.m., "Mohammed Guller" <moham...@glassbeam.com> wrote:
>
>> Assume the batch interval is 10 seconds and batch processing time is 30
>> seconds. So while Spark Streaming is processing the first batch, the
>> receiver will have a backlog of 20 seconds worth of data. By the time Spark
>> Streaming finishes batch #2, the receiver will have 40 seconds worth of
>> data in memory buffer. This backlog will keep growing as time passes
>> assuming data streams in consistently at the same rate.
>>
>> Also keep in mind that windowing operations on a DStream implicitly
>> persist every RDD in a DStream in memory.
>>
>> Mohammed
>>
>> -Original Message-
>> From: Jacek Laskowski [mailto:ja...@japila.pl]
>> Sent: Thursday, August 4, 2016 4:25 PM
>> To: Mohammed Guller
>> Cc: Saurav Sinha; user
>> Subject: Re: Explanation regarding Spark Streaming
>>
>> On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller <moham...@glassbeam.com>
>> wrote:
>> > and eventually you will run out of memory.
>>
>> Why? Mind elaborating?
>>
>> Jacek
>>
>


RE: Explanation regarding Spark Streaming

2016-08-06 Thread Jacek Laskowski
Hi,

Thanks for explanation, but it does not prove Spark will OOM at some point.
You assume enough data to store but there could be none.

Jacek

On 6 Aug 2016 4:23 a.m., "Mohammed Guller" <moham...@glassbeam.com> wrote:

> Assume the batch interval is 10 seconds and batch processing time is 30
> seconds. So while Spark Streaming is processing the first batch, the
> receiver will have a backlog of 20 seconds worth of data. By the time Spark
> Streaming finishes batch #2, the receiver will have 40 seconds worth of
> data in memory buffer. This backlog will keep growing as time passes
> assuming data streams in consistently at the same rate.
>
> Also keep in mind that windowing operations on a DStream implicitly
> persist every RDD in a DStream in memory.
>
> Mohammed
>
> -Original Message-
> From: Jacek Laskowski [mailto:ja...@japila.pl]
> Sent: Thursday, August 4, 2016 4:25 PM
> To: Mohammed Guller
> Cc: Saurav Sinha; user
> Subject: Re: Explanation regarding Spark Streaming
>
> On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller <moham...@glassbeam.com>
> wrote:
> > and eventually you will run out of memory.
>
> Why? Mind elaborating?
>
> Jacek
>


RE: Explanation regarding Spark Streaming

2016-08-05 Thread Mohammed Guller
Assume the batch interval is 10 seconds and batch processing time is 30 
seconds. So while Spark Streaming is processing the first batch, the receiver 
will have a backlog of 20 seconds worth of data. By the time Spark Streaming 
finishes batch #2, the receiver will have 40 seconds worth of data in memory 
buffer. This backlog will keep growing as time passes assuming data streams in 
consistently at the same rate.

Also keep in mind that windowing operations on a DStream implicitly persist 
every RDD in a DStream in memory.

Mohammed

-Original Message-
From: Jacek Laskowski [mailto:ja...@japila.pl] 
Sent: Thursday, August 4, 2016 4:25 PM
To: Mohammed Guller
Cc: Saurav Sinha; user
Subject: Re: Explanation regarding Spark Streaming

On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller <moham...@glassbeam.com> wrote:
> and eventually you will run out of memory.

Why? Mind elaborating?

Jacek

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Explanation regarding Spark Streaming

2016-08-04 Thread Jacek Laskowski
On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller  wrote:
> and eventually you will run out of memory.

Why? Mind elaborating?

Jacek

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Explanation regarding Spark Streaming

2016-08-04 Thread Mich Talebzadeh
Also check spark UI streaming section for various helpful stats. by default
it runs on 4040 but can change it by setting--conf "spark.ui.port="

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 4 August 2016 at 23:48, Mohammed Guller  wrote:

> The backlog will increase as time passes and eventually you will run out
> of memory.
>
>
>
> Mohammed
>
> Author: Big Data Analytics with Spark
> 
>
>
>
> *From:* Saurav Sinha [mailto:sauravsinh...@gmail.com]
> *Sent:* Wednesday, August 3, 2016 11:57 PM
> *To:* user
> *Subject:* Explanation regarding Spark Streaming
>
>
>
> Hi,
>
>
>
> I have query
>
>
>
> Q1. What will happen if spark streaming job have batchDurationTime as 60
> sec and processing time of complete pipeline is greater then 60 sec.
>
>
>
> --
>
> Thanks and Regards,
>
>
>
> Saurav Sinha
>
>
>
> Contact: 9742879062
>


RE: Explanation regarding Spark Streaming

2016-08-04 Thread Mohammed Guller
The backlog will increase as time passes and eventually you will run out of 
memory.

Mohammed
Author: Big Data Analytics with 
Spark

From: Saurav Sinha [mailto:sauravsinh...@gmail.com]
Sent: Wednesday, August 3, 2016 11:57 PM
To: user
Subject: Explanation regarding Spark Streaming

Hi,

I have query

Q1. What will happen if spark streaming job have batchDurationTime as 60 sec 
and processing time of complete pipeline is greater then 60 sec.

--
Thanks and Regards,

Saurav Sinha

Contact: 9742879062