I had a look at the thread. This is what you have which I gather a standalone box in other words one worker node
bin/spark-submit --master spark://ES01:7077 --executor-memory 4G --num-executors 1 --total-executor-cores 1 ./latest5min.py 1>a.log 2>b.log But what I don't understand why is using 80% of your RAM as opposed to 25% of it (4GB/16GB) right? Where else have you set up these parameters for example in $SPARK_HOME/con/spark-env.sh? Can you send the output of /usr/bin/free and top HTH Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* http://talebzadehmich.wordpress.com On 9 May 2016 at 16:19, 李明伟 <kramer2...@126.com> wrote: > Thanks for all the information guys. > > I wrote some code to do the test. Not using window. So only calculating > data for each batch interval. I set the interval to 30 seconds also reduce > the size of data to about 30 000 lines of csv. > Means my code should calculation on 30 000 lines of CSV in 30 seconds. I > think it is not a very heavy workload. But my spark stream code still crash. > > I send another post to the user list here > http://apache-spark-user-list.1001560.n3.nabble.com/Why-I-have-memory-leaking-for-such-simple-spark-stream-code-td26904.html > > Is it possible for you to have a look please? Very appreciate. > > > > > > At 2016-05-09 17:49:22, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: > > Pease see the inline comments. > > > On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar <ashok34...@yahoo.com> wrote: > >> Thank you. >> >> So If I create spark streaming then >> >> >> 1. The streams will always need to be cached? It cannot be stored in >> persistent storage >> >> You don't need to cache the stream explicitly if you don't have specific > requirement, Spark will do it for you depends on different streaming > sources (Kafka or socket). > >> >> 1. The stream data cached will be distributed among all nodes of >> Spark among executors >> 2. As I understand each Spark worker node has one executor that >> includes cache. So the streaming data is distributed among these work node >> caches. For example if I have 4 worker nodes each cache will have a >> quarter >> of data (this assumes that cache size among worker nodes is the same.) >> >> Ideally, it will distributed evenly across the executors, also this is > target for tuning. Normally it depends on several conditions like receiver > distribution, partition distribution. > > >> >> The issue raises if the amount of streaming data does not fit into these >> 4 caches? Will the job crash? >> >> >> On Monday, 9 May 2016, 10:16, Saisai Shao <sai.sai.s...@gmail.com> wrote: >> >> >> No, each executor only stores part of data in memory (it depends on how >> the partition are distributed and how many receivers you have). >> >> For WindowedDStream, it will obviously cache the data in memory, from my >> understanding you don't need to call cache() again. >> >> On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34...@yahoo.com> wrote: >> >> hi, >> >> so if i have 10gb of streaming data coming in does it require 10gb of >> memory in each node? >> >> also in that case why do we need using >> >> dstream.cache() >> >> thanks >> >> >> On Monday, 9 May 2016, 9:58, Saisai Shao <sai.sai.s...@gmail.com> wrote: >> >> >> It depends on you to write the Spark application, normally if data is >> already on the persistent storage, there's no need to be put into memory. >> The reason why Spark Streaming has to be stored in memory is that streaming >> source is not persistent source, so you need to have a place to store the >> data. >> >> On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote: >> >> Thanks. >> What if I use batch calculation instead of stream computing? Do I still >> need that much memory? For example, if the 24 hour data set is 100 GB. Do I >> also need a 100GB RAM to do the one time batch calculation ? >> >> >> >> >> >> At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: >> >> For window related operators, Spark Streaming will cache the data into >> memory within this window, in your case your window size is up to 24 hours, >> which means data has to be in Executor's memory for more than 1 day, this >> may introduce several problems when memory is not enough. >> >> On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh < >> mich.talebza...@gmail.com> wrote: >> >> ok terms for Spark Streaming >> >> "Batch interval" is the basic interval at which the system with receive >> the data in batches. >> This is the interval set when creating a StreamingContext. For example, >> if you set the batch interval as 300 seconds, then any input DStream will >> generate RDDs of received data at 300 seconds intervals. >> A window operator is defined by two parameters - >> - WindowDuration / WindowsLength - the length of the window >> - SlideDuration / SlidingInterval - the interval at which the window will >> slide or move forward >> >> >> Ok so your batch interval is 5 minutes. That is the rate messages are >> coming in from the source. >> >> Then you have these two params >> >> // window length - The duration of the window below that must be multiple >> of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) >> val windowLength = x = m * n >> // sliding interval - The interval at which the window operation is >> performed in other words data is collected within this "previous interval' >> val slidingInterval = y l x/y = even number >> >> Both the window length and the slidingInterval duration must be multiples >> of the batch interval, as received data is divided into batches of duration >> "batch interval". >> >> If you want to collect 1 hour data then windowLength = 12 * 5 * 60 >> seconds >> If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 >> >> You sliding window should be set to batch interval = 5 * 60 seconds. In >> other words that where the aggregates and summaries come for your report. >> >> What is your data source here? >> >> HTH >> >> >> Dr Mich Talebzadeh >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> http://talebzadehmich.wordpress.com >> >> >> On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: >> >> We have some stream data need to be calculated and considering use spark >> stream to do it. >> >> We need to generate three kinds of reports. The reports are based on >> >> 1. The last 5 minutes data >> 2. The last 1 hour data >> 3. The last 24 hour data >> >> The frequency of reports is 5 minutes. >> >> After reading the docs, the most obvious way to solve this seems to set >> up a >> spark stream with 5 minutes interval and two window which are 1 hour and 1 >> day. >> >> >> But I am worrying that if the window is too big for one day and one hour. >> I >> do not have much experience on spark stream, so what is the window length >> in >> your environment? >> >> Any official docs talking about this? >> >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >> >> >> >> >> >> >> >> >> >> >> >> > > > >