RE: Explanation regarding Spark Streaming

Mohammed Guller Sat, 06 Aug 2016 12:36:07 -0700

According to the docs for Spark Streaming, the default for data received 
through receivers is MEMORY_AND_DISK_SER_2. If windowing operations are 
performed, RDDs are persisted with StorageLevel.MEMORY_ONLY_SER.

http://spark.apache.org/docs/latest/streaming-programming-guide.html#data-serialization

Mohammed
Author: Big Data Analytics with 
Spark<http://www.amazon.com/Big-Data-Analytics-Spark-Practitioners/dp/1484209656/>

From: Mich Talebzadeh [mailto:mich.talebza...@gmail.com]
Sent: Saturday, August 6, 2016 12:25 PM
To: Mohammed Guller
Cc: Jacek Laskowski; Saurav Sinha; user
Subject: Re: Explanation regarding Spark Streaming

Hi,

I think the default storage level 
<http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence> is 
MEMORY_ONLY

HTH

Dr Mich Talebzadeh

LinkedIn  
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

http://talebzadehmich.wordpress.com

Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.

On 6 August 2016 at 18:16, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Hi Jacek,
Yes, I am assuming that data streams in consistently at the same rate (for 
example, 100MB/s).

BTW, even if the persistence level for streaming data is set to 
MEMORY_AND_DISK_SER_2 (the default), once Spark runs out of memory, data will 
spill to  disk. That will make the application performance even worse.

Mohammed

From: Jacek Laskowski [mailto:ja...@japila.pl<mailto:ja...@japila.pl>]
Sent: Saturday, August 6, 2016 1:54 AM
To: Mohammed Guller
Cc: Saurav Sinha; user
Subject: RE: Explanation regarding Spark Streaming

Hi,

Thanks for explanation, but it does not prove Spark will OOM at some point. You 
assume enough data to store but there could be none.

Jacek

On 6 Aug 2016 4:23 a.m., "Mohammed Guller" 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
Assume the batch interval is 10 seconds and batch processing time is 30 
seconds. So while Spark Streaming is processing the first batch, the receiver 
will have a backlog of 20 seconds worth of data. By the time Spark Streaming 
finishes batch #2, the receiver will have 40 seconds worth of data in memory 
buffer. This backlog will keep growing as time passes assuming data streams in 
consistently at the same rate.

Also keep in mind that windowing operations on a DStream implicitly persist 
every RDD in a DStream in memory.

Mohammed

-----Original Message-----
From: Jacek Laskowski [mailto:ja...@japila.pl<mailto:ja...@japila.pl>]
Sent: Thursday, August 4, 2016 4:25 PM
To: Mohammed Guller
Cc: Saurav Sinha; user
Subject: Re: Explanation regarding Spark Streaming

On Fri, Aug 5, 2016 at 12:48 AM, Mohammed Guller 
<moham...@glassbeam.com<mailto:moham...@glassbeam.com>> wrote:
> and eventually you will run out of memory.

Why? Mind elaborating?

Jacek

RE: Explanation regarding Spark Streaming

Reply via email to