Re: Re: How big the spark stream window could be ?

Saisai Shao Mon, 09 May 2016 02:50:04 -0700

Pease see the inline comments.


On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar <ashok34...@yahoo.com> wrote:

> Thank you.
>
> So If I create spark streaming then
>
>
>    1. The streams will always need to be cached? It cannot be stored in
>    persistent storage
>
> You don't need to cache the stream explicitly if you don't have specific
requirement, Spark will do it for you depends on different streaming
sources (Kafka or socket).

>
>    1. The stream data cached will be distributed among all nodes of Spark
>    among executors
>    2. As I understand each Spark worker node has one executor that
>    includes cache. So the streaming data is distributed among these work node
>    caches. For example if I have 4 worker nodes each cache will have a quarter
>    of data (this assumes that cache size among worker nodes is the same.)
>
> Ideally, it will distributed evenly across the executors, also this is
target for tuning. Normally it depends on several conditions like receiver
distribution, partition distribution.


>
> The issue raises if the amount of streaming data does not fit into these 4
> caches? Will the job crash?
>
>
> On Monday, 9 May 2016, 10:16, Saisai Shao <sai.sai.s...@gmail.com> wrote:
>
>
> No, each executor only stores part of data in memory (it depends on how
> the partition are distributed and how many receivers you have).
>
> For WindowedDStream, it will obviously cache the data in memory, from my
> understanding you don't need to call cache() again.
>
> On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34...@yahoo.com> wrote:
>
> hi,
>
> so if i have 10gb of streaming data coming in does it require 10gb of
> memory in each node?
>
> also in that case why do we need using
>
> dstream.cache()
>
> thanks
>
>
> On Monday, 9 May 2016, 9:58, Saisai Shao <sai.sai.s...@gmail.com> wrote:
>
>
> It depends on you to write the Spark application, normally if data is
> already on the persistent storage, there's no need to be put into memory.
> The reason why Spark Streaming has to be stored in memory is that streaming
> source is not persistent source, so you need to have a place to store the
> data.
>
> On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote:
>
> Thanks.
> What if I use batch calculation instead of stream computing? Do I still
> need that much memory? For example, if the 24 hour data set is 100 GB. Do I
> also need a 100GB RAM to do the one time batch calculation ?
>
>
>
>
>
> At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote:
>
> For window related operators, Spark Streaming will cache the data into
> memory within this window, in your case your window size is up to 24 hours,
> which means data has to be in Executor's memory for more than 1 day, this
> may introduce several problems when memory is not enough.
>
> On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com
> > wrote:
>
> ok terms for Spark Streaming
>
> "Batch interval" is the basic interval at which the system with receive
> the data in batches.
> This is the interval set when creating a StreamingContext. For example, if
> you set the batch interval as 300 seconds, then any input DStream will
> generate RDDs of received data at 300 seconds intervals.
> A window operator is defined by two parameters -
> - WindowDuration / WindowsLength - the length of the window
> - SlideDuration / SlidingInterval - the interval at which the window will
> slide or move forward
>
>
> Ok so your batch interval is 5 minutes. That is the rate messages are
> coming in from the source.
>
> Then you have these two params
>
> // window length - The duration of the window below that must be multiple
> of batch interval n in = > StreamingContext(sparkConf, Seconds(n))
> val windowLength = x =  m * n
> // sliding interval - The interval at which the window operation is
> performed in other words data is collected within this "previous interval'
> val slidingInterval =  y l x/y = even number
>
> Both the window length and the slidingInterval duration must be multiples
> of the batch interval, as received data is divided into batches of duration
> "batch interval".
>
> If you want to collect 1 hour data then windowLength =  12 * 5 * 60
> seconds
> If you want to collect 24 hour data then windowLength =  24 * 12 * 5 * 60
>
> You sliding window should be set to batch interval = 5 * 60 seconds. In
> other words that where the aggregates and summaries come for your report.
>
> What is your data source here?
>
> HTH
>
>
> Dr Mich Talebzadeh
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
> http://talebzadehmich.wordpress.com
>
>
> On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote:
>
> We have some stream data need to be calculated and considering use spark
> stream to do it.
>
> We need to generate three kinds of reports. The reports are based on
>
> 1. The last 5 minutes data
> 2. The last 1 hour data
> 3. The last 24 hour data
>
> The frequency of reports is 5 minutes.
>
> After reading the docs, the most obvious way to solve this seems to set up
> a
> spark stream with 5 minutes interval and two window which are 1 hour and 1
> day.
>
>
> But I am worrying that if the window is too big for one day and one hour. I
> do not have much experience on spark stream, so what is the window length
> in
> your environment?
>
> Any official docs talking about this?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>
>
>
>
>
>
>
>
>
>

Re: Re: How big the spark stream window could be ?

Reply via email to