Re: DStream spark paper

Matei Zaharia Thu, 20 Mar 2014 17:33:06 -0700

Hi Adrian,

On every timestep of execution, we receive new data, then report updated word 
counts for that new data plus the past 30 seconds. The latency here is about 
how quickly you get these updated counts once the new batch of data comes in. 
It’s true that the count reflects some data from 30 seconds ago as well, but it 
doesn’t mean the overall processing latency is 30 seconds.


Matei

On Mar 20, 2014, at 1:36 PM, Adrian Mocanu <amoc...@verticalscope.com> wrote:

> I looked over the specs on page 9 from 
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf
> The first paragraph mentions the window size is 30 seconds “Word-Count, which 
> performs a sliding window count over 30s;
> and TopKCount, which finds the k most frequent words over the past 30s. “
>  
> The second paragraph mentions subsecond latency.
>  
> Putting these 2 together, is the paper saying that in the 30 sec window the 
> tuples are delayed at most 1 second?
>  
> The paper explains “By “end-to-end latency,” we mean the time from when 
> records are sent to the system to when results incorporating them appear.” 
> This leads me to conclude that end-to-end latency for a 30 sec window should 
> be at least 30 seconds because results won’t be incorporated until the entire 
> window is completed ie: 30sec. At the same time the paper claims latency is 
> sub second so clearly I’m misunderstanding something.
>  
> -Adrian

Re: DStream spark paper

Reply via email to