Pease see the inline comments.
On Mon, May 9, 2016 at 5:31 PM, Ashok Kumar <ashok34...@yahoo.com> wrote: > Thank you. > > So If I create spark streaming then > > > 1. The streams will always need to be cached? It cannot be stored in > persistent storage > > You don't need to cache the stream explicitly if you don't have specific requirement, Spark will do it for you depends on different streaming sources (Kafka or socket). > > 1. The stream data cached will be distributed among all nodes of Spark > among executors > 2. As I understand each Spark worker node has one executor that > includes cache. So the streaming data is distributed among these work node > caches. For example if I have 4 worker nodes each cache will have a quarter > of data (this assumes that cache size among worker nodes is the same.) > > Ideally, it will distributed evenly across the executors, also this is target for tuning. Normally it depends on several conditions like receiver distribution, partition distribution. > > The issue raises if the amount of streaming data does not fit into these 4 > caches? Will the job crash? > > > On Monday, 9 May 2016, 10:16, Saisai Shao <sai.sai.s...@gmail.com> wrote: > > > No, each executor only stores part of data in memory (it depends on how > the partition are distributed and how many receivers you have). > > For WindowedDStream, it will obviously cache the data in memory, from my > understanding you don't need to call cache() again. > > On Mon, May 9, 2016 at 5:06 PM, Ashok Kumar <ashok34...@yahoo.com> wrote: > > hi, > > so if i have 10gb of streaming data coming in does it require 10gb of > memory in each node? > > also in that case why do we need using > > dstream.cache() > > thanks > > > On Monday, 9 May 2016, 9:58, Saisai Shao <sai.sai.s...@gmail.com> wrote: > > > It depends on you to write the Spark application, normally if data is > already on the persistent storage, there's no need to be put into memory. > The reason why Spark Streaming has to be stored in memory is that streaming > source is not persistent source, so you need to have a place to store the > data. > > On Mon, May 9, 2016 at 4:43 PM, 李明伟 <kramer2...@126.com> wrote: > > Thanks. > What if I use batch calculation instead of stream computing? Do I still > need that much memory? For example, if the 24 hour data set is 100 GB. Do I > also need a 100GB RAM to do the one time batch calculation ? > > > > > > At 2016-05-09 15:14:47, "Saisai Shao" <sai.sai.s...@gmail.com> wrote: > > For window related operators, Spark Streaming will cache the data into > memory within this window, in your case your window size is up to 24 hours, > which means data has to be in Executor's memory for more than 1 day, this > may introduce several problems when memory is not enough. > > On Mon, May 9, 2016 at 3:01 PM, Mich Talebzadeh <mich.talebza...@gmail.com > > wrote: > > ok terms for Spark Streaming > > "Batch interval" is the basic interval at which the system with receive > the data in batches. > This is the interval set when creating a StreamingContext. For example, if > you set the batch interval as 300 seconds, then any input DStream will > generate RDDs of received data at 300 seconds intervals. > A window operator is defined by two parameters - > - WindowDuration / WindowsLength - the length of the window > - SlideDuration / SlidingInterval - the interval at which the window will > slide or move forward > > > Ok so your batch interval is 5 minutes. That is the rate messages are > coming in from the source. > > Then you have these two params > > // window length - The duration of the window below that must be multiple > of batch interval n in = > StreamingContext(sparkConf, Seconds(n)) > val windowLength = x = m * n > // sliding interval - The interval at which the window operation is > performed in other words data is collected within this "previous interval' > val slidingInterval = y l x/y = even number > > Both the window length and the slidingInterval duration must be multiples > of the batch interval, as received data is divided into batches of duration > "batch interval". > > If you want to collect 1 hour data then windowLength = 12 * 5 * 60 > seconds > If you want to collect 24 hour data then windowLength = 24 * 12 * 5 * 60 > > You sliding window should be set to batch interval = 5 * 60 seconds. In > other words that where the aggregates and summaries come for your report. > > What is your data source here? > > HTH > > > Dr Mich Talebzadeh > > LinkedIn * > https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw > <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > http://talebzadehmich.wordpress.com > > > On 9 May 2016 at 04:19, kramer2...@126.com <kramer2...@126.com> wrote: > > We have some stream data need to be calculated and considering use spark > stream to do it. > > We need to generate three kinds of reports. The reports are based on > > 1. The last 5 minutes data > 2. The last 1 hour data > 3. The last 24 hour data > > The frequency of reports is 5 minutes. > > After reading the docs, the most obvious way to solve this seems to set up > a > spark stream with 5 minutes interval and two window which are 1 hour and 1 > day. > > > But I am worrying that if the window is too big for one day and one hour. I > do not have much experience on spark stream, so what is the window length > in > your environment? > > Any official docs talking about this? > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-big-the-spark-stream-window-could-be-tp26899.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > > > > > > > > > > > > >