It seems StreamingContext has a function: def remember(duration: Duration) { graph.remember(duration) }
and in my opinion, incremental reduce means: 1 2 3 4 5 6 window_size =5 sum_of_first_window = 1+2+3+4+5=15 sum_of_second_window_method_1=2+3+4+5+6=20 sum_of_second_window_method_2=15+6-1=20 On Wed, Feb 19, 2014 at 3:45 PM, Adrian Mocanu <amoc...@verticalscope.com>wrote: > I have a question on the following paper > > "Discretized Streams: Fault-Tolerant Streaming Computation at Scale" > > written by > > Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, > Ion Stoica > > and available at > > http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf > > > > Specifically I'm interested in Section 3.2 on page 5 called "Timing > Considerations". > > This section talks about external timestamp. For me I'm looking to use > method 2 and correct for late records at the > > application level. > > The paper says "[application] could output a new count for time interval > [t, t+1) at time t+5, based on the records for this interval received > between t and t+5. This computation can be performed with an efficient > incremental reduce operation that adds the old counts computed at t+1 to > the counts of new records since then, avoiding wasted work." > > > > Q1: > > If my data comes 24 hours late on the same DStream (ie: t=t0+24hr) and I'm > recording per minute aggregates, wouldn't the RDD with data which came 24 > hours ago be already deleted from disk by Spark? (I'd hope so otherwise it > runs out of space) > > > > Q2: > > The paper talks about "incremental reduce". I'd like to know what it is. I > do use reduce so I could get an aggregate of counts. What is this > incremental reduce? > > > > Thanks > > -A > -- Dachuan Huang Cellphone: 614-390-7234 2015 Neil Avenue Ohio State University Columbus, Ohio U.S.A. 43210