It seems StreamingContext has a function:

  def remember(duration: Duration) {
    graph.remember(duration)
  }

and in my opinion, incremental reduce means:

1 2 3 4 5 6
window_size =5
sum_of_first_window = 1+2+3+4+5=15
sum_of_second_window_method_1=2+3+4+5+6=20
sum_of_second_window_method_2=15+6-1=20


On Wed, Feb 19, 2014 at 3:45 PM, Adrian Mocanu <amoc...@verticalscope.com>wrote:

>  I have a question on the following paper
>
> "Discretized Streams: Fault-Tolerant Streaming Computation at Scale"
>
> written by
>
> Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker,
> Ion Stoica
>
> and available at
>
> http://www.cs.berkeley.edu/~matei/papers/2013/sosp_spark_streaming.pdf
>
>
>
> Specifically I'm interested in Section 3.2 on page 5 called "Timing
> Considerations".
>
> This section talks about external timestamp. For me I'm looking to use
> method 2 and correct for late records at the
>
> application level.
>
> The paper says "[application] could output a new count for time interval
> [t, t+1) at time t+5, based on the records for this interval received
> between t and t+5. This computation can be performed with an efficient
> incremental reduce operation that adds the old counts computed at t+1 to
> the counts of new records since then, avoiding wasted work."
>
>
>
> Q1:
>
> If my data comes 24 hours late on the same DStream (ie: t=t0+24hr) and I'm
> recording per minute aggregates, wouldn't the RDD with data which came 24
> hours ago be already deleted from disk by Spark? (I'd hope so otherwise it
> runs out of space)
>
>
>
> Q2:
>
> The paper talks about "incremental reduce". I'd like to know what it is. I
> do use reduce so I could get an aggregate of counts. What is this
> incremental reduce?
>
>
>
> Thanks
>
> -A
>



-- 
Dachuan Huang
Cellphone: 614-390-7234
2015 Neil Avenue
Ohio State University
Columbus, Ohio
U.S.A.
43210

Reply via email to