Re: Re: Re: How big the spark stream window could be ?

Mich Talebzadeh Mon, 09 May 2016 08:45:47 -0700

I had a look at the thread.

This is what you have which I gather a standalone box in other words one
worker node


bin/spark-submit   --master spark://ES01:7077 --executor-memory 4G
--num-executors 1 --total-executor-cores 1 ./latest5min.py 1>a.log 2>b.log

But what I don't understand why is using 80% of your RAM as opposed to 25%
of it (4GB/16GB) right?

Where else have you set up these parameters for example in
$SPARK_HOME/con/spark-env.sh?

Can you send the output of /usr/bin/free and top

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 9 May 2016 at 16:19, 李明伟 <kramer2...@126.com> wrote:

> Thanks for all the information guys.
>
> I wrote some code to do the test. Not using window. So only calculating
> data for each batch interval. I set the interval to 30 seconds also reduce
> the size of data to about 30 000 lines of csv.
> Means my code should calculation on 30 000 lines of CSV in 30 seconds. I
> think it is not a very heavy workload. But my spark stream code still crash.
>
> I send another post to the user list here
> http://apache-spark-user-list.1001560.n3.nabble.com/Why-I-have-memory-leaking-for-such-simple-spark-stream-code-td26904.html
>
> Is it possible for you to have a look please? Very appreciate.
>
>
>
>
>
> At 2016-05-09 17:49:22, "Saisai Shao" <sai.sai.s...@gmail.com> wrote:
>
> Pease see the inline comments.
>
>
> On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar <ashok34...@yahoo.com> wrote:
>
>> Thank you.
>>
>> So If I create spark streaming then
>>
>>
>>    1. The streams will always need to be cached? It cannot be stored in
>>    persistent storage
>>
>> You don't need to cache the stream explicitly if you don't have specific
> requirement, Spark will do it for you depends on different streaming
> sources (Kafka or socket).
>
>>
>>    1. The stream data cached will be distributed among all nodes of
>>    Spark among executors
>>    2. As I understand each Spark worker node has one executor that
>>    includes cache. So the streaming data is distributed among these work node
>>    caches. For example if I have 4 worker nodes each cache will have a 
>> quarter
>>    of data (this assumes that cache size among worker nodes is the same.)
>>
>> Ideally, it will distributed evenly across the executors, also this is
> target for tuning. Normally it depends on several conditions like receiver
> distribution, partition distribution.
>
>
>>
>> The issue raises if the amount of streaming data does not fit into these
>> 4 caches? Will the job crash?
>>
>>
>> On Monday, 9 May 2016, 10:16, Saisai Shao <sai.sai.s...@gmail.com> wrote:
>>
>>
>> No, each executor only stores part of data in memory (it depends on how
>> the partition are distributed and how many receivers you have).
>>
>> For WindowedDStream, it will obviously cache the data in memory, from my
>> understanding you don't need to call cache() again.
>>
>> On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34...@yahoo.com> wrote:
>>
>> hi,
>>
>> so if i have 10gb of streaming data coming in does it require 10gb of
>> memory in each node?
>>
>> also in that case why do we need using
>>
>> dstream.cache()
>>
>> thanks
>>
>>
>> On Monday, 9 May 2016, 9:58, Saisai Shao <sai.sai.s...@gmail.com> wrote:
>>
>>
>> It depends on you to write the Spark application, normally if data is
>> already on the persistent storage, there's no need to be put into memory.
>> The reason why Spark Streaming has to be stored in memory is that streaming
>> source is not persistent source, so you need to have a place to store the
>> data.
>>
>> On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote:
>>
>> Thanks.
>> What if I use batch calculation instead of stream computing? Do I still
>> need that much memory? For example, if the 24 hour data set is 100 GB. Do I
>> also need a 100GB RAM to do the one time batch calculation ?
>>
>>
>>
>>
>>
>> At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote:
>>
>> For window related operators, Spark Streaming will cache the data into
>> memory within this window, in your case your window size is up to 24 hours,
>> which means data has to be in Executor's memory for more than 1 day, this
>> may introduce several problems when memory is not enough.
>>
>> On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>> ok terms for Spark Streaming
>>
>> "Batch interval" is the basic interval at which the system with receive
>> the data in batches.
>> This is the interval set when creating a StreamingContext. For example,
>> if you set the batch interval as 300 seconds, then any input DStream will
>> generate RDDs of received data at 300 seconds intervals.
>> A window operator is defined by two parameters -
>> - WindowDuration / WindowsLength - the length of the window
>> - SlideDuration / SlidingInterval - the interval at which the window will
>> slide or move forward
>>
>>
>> Ok so your batch interval is 5 minutes. That is the rate messages are
>> coming in from the source.
>>
>> Then you have these two params
>>
>> // window length - The duration of the window below that must be multiple
>> of batch interval n in = > StreamingContext(sparkConf, Seconds(n))
>> val windowLength = x =  m * n
>> // sliding interval - The interval at which the window operation is
>> performed in other words data is collected within this "previous interval'
>> val slidingInterval =  y l x/y = even number
>>
>> Both the window length and the slidingInterval duration must be multiples
>> of the batch interval, as received data is divided into batches of duration
>> "batch interval".
>>
>> If you want to collect 1 hour data then windowLength =  12 * 5 * 60
>> seconds
>> If you want to collect 24 hour data then windowLength =  24 * 12 * 5 * 60
>>
>> You sliding window should be set to batch interval = 5 * 60 seconds. In
>> other words that where the aggregates and summaries come for your report.
>>
>> What is your data source here?
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote:
>>
>> We have some stream data need to be calculated and considering use spark
>> stream to do it.
>>
>> We need to generate three kinds of reports. The reports are based on
>>
>> 1. The last 5 minutes data
>> 2. The last 1 hour data
>> 3. The last 24 hour data
>>
>> The frequency of reports is 5 minutes.
>>
>> After reading the docs, the most obvious way to solve this seems to set
>> up a
>> spark stream with 5 minutes interval and two window which are 1 hour and 1
>> day.
>>
>>
>> But I am worrying that if the window is too big for one day and one hour.
>> I
>> do not have much experience on spark stream, so what is the window length
>> in
>> your environment?
>>
>> Any official docs talking about this?
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>
>
>
>

Re: Re: Re: How big the spark stream window could be ?

Reply via email to