Re: Spark streaming, sliding window example and explaination

ngoc linh Fri, 03 Jan 2014 00:41:24 -0800

>
> Thanks so much for your explaination,
> I would like to ask more specifically:
>
>
> 1. About batchInterval:
> Given pseudo code block:
>
> inputStream = readStream("http://...";, "2s")
> //start Block_1
> ones = inputStream.map(event => (event.url, 1))
> counts = ones.runningReduce((a, b) => a + b)
> //end Block_1
>
> So, does that mean the Block_1 runs on every 2 secs (i did think this case 
> = while(true) { execute Block_1; sleep(2s); } )?
> And, would the lifecycle of DStream inputStream be like following ?:
> @t_0: execute Block_1 with inputStream = [RDD@t_0]
> @t_2: execute Block_1 with inputStream = [RDD@t_0, RDD@t_2]
> @t_4: execute Block_1 with inputStream = [RDD@t_0, RDD@t_2, RDD@t_4]
>
>
> 2. About window method:
> Given pseudo code block:
>
> inputStream = readStream("http://...";, "2s")
> windowStream = inputStream.window(Seconds(8), Seconds(4))
> //start Block_2
> ones = windowStream.map(event => (event.url, 1))
> counts = ones.runningReduce((a, b) => a + b)
> //end Block_2
>
> So, does that mean the Block_2 runs on every 4 secs (i did think this case 
> = while(true) { execute Block_2; sleep(4s); } )?
> And, would the lifecycle of DStream windowStream be like following ?:
> @t_0 -> @t_7: none window completed, Block_2 does not execute.
> @t_8: execute Block_2 with windowStream = [RDD@t_0->8]
> @t_12: execute Block_2 with windowStream = [RDD@t_0->8, RDD@t_4->12]
> @t_16: execute Block_2 with windowStream = [RDD@t_0->8, RDD@t_4->12, 
> RDD@t_8->16]
>
>
> 3. About the realtime data store
> I have read a paper about a realtime datastore named Druid (from 
> Metamarket, if i remember correctly). 
> In short, Druid = streaming engine(Kafka + Storm) + zookeeper + realtime 
> nodes + historical nodes (it is said that Druid queries 6T (in-mem) data in 
> 1.4 secs).
> Then the question is, if i want to build the realtime data store solution 
> upon Spark platform (or BDAS) then what is the recommended model?
>
>
Look forward your help.


 

> On Thursday, January 2, 2014 5:55:26 PM UTC+7, TD wrote:
> Hello, 
>
> "Batch interval" is the basic interval at which the system with receive 
> the data in batches. This is the interval set when creating a 
> StreamingContext. For example, if you set the batch interval as 2 second, 
> then any input DStream will generate RDDs of received data at 2 second 
> intervals. 
>
> A window operator is defined by two parameters - 
> - WindowDuration - the length of the window
> - SlideDuration - the interval at which the window will slide or move 
> forward
> Its a bit hard to explain the sliding of a window in words, so slides may 
> be more useful. Take a look at slides 27 - 29 in the attached slides. 
>
> Both the window duration and the slide duration must be multiples of the 
> batch interval, as received data is divided into batches of duration "batch 
> interval". Lets take an example. Suppose we have a batch interval of 2 
> seconds and we have defined an input stream.
>
> val inputsStream = ssc.socketStream(...)
>
> This inputStream will generate RDDs every 2 seconds, containing last 2 
> second of data. Now say we define a few window operation on this. The 
> window operation is defined as DStream.window(<window duration>, <slide 
> duration>)
>
> val windowStream1 = inputStream.window(Seconds(4))
> val windowStream2 = inputStream.window(Seconds(4), Seconds(2))
> val windowStream3 = inputStream.window(Seconds(10), Seconds(4))
> val windowStream4 = inputStream.window(Seconds(10), Seconds(10)
> val windowStream5 = inputStream.window(Seconds(2), Seconds(2))    // same 
> as inputStream
> val windowStream6 = inputStream.window(Seconds(11), Seconds(2))   // 
> invalid
> val windowStream7 = inputStream.window(Seconds(4), Seconds(1))    // 
> invalid
>
>
> Both, windowStream1 and windowStream2 will generate RDDs containing data 
> from last 4 seconds. And the RDDs will be generated every 2 seconds (if the 
> slide duration is not specified as in windowStream1, then the slide 
> duration was assumed to be inputStream's batch duration = 2 sec). Note that 
> each of these windows of data are overlapping. Window RDD at time 10 will 
> contain data from times 6 to 10 (i.e. slightly after 6 to end of 10), and 
> window RDD at time 12 will contain data from 8 to 12. 
>
> Similarly, windowStream3 will generate RDDs every 4 seconds, each 
> containing data from last 10 seconds. And windowStream4 will generate 
> non-overlapping windows, that is, RDDs every 10 seconds, containing data 
> from last 10 seconds. windowStream5 is essentially same as the inputStream.
>
> windowStream6 and windowStream7 are invalid because one of the two 
> parameters is not a multiple of the batch interval, that is, 2 seconds. 
> This is how the three are related. 
>
> Hope that helped. Note that I did simplify a few details that are 
> important when you want to define window operations over windowed streams. 
> I am ignoring them for now. Feel free to ask more specific questions.
>
> TD
>
>
>
>
>
>
>
>
>
>
>
> On Thu, Jan 2, 2014 at 1:59 AM, ngoc linh <[email protected]> wrote:
> Dear Matei, Tathagata,
>
> Could you guys provide me an example with explaination sliding window (in 
> Spark streaming). I was confusing about the relationship between 
> batchInterval-windowDuration-slideDuration.
> -- 
> You received this message because you are subscribed to the Google Groups 
> "Spark Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Spark Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Spark streaming, sliding window example and explaination

Reply via email to