Spark streaming updating a large window more frequently

2015-05-08 Thread Ankur Chauhan
Hi,

I am pretty new to spark/spark_streaming so please excuse my naivety. I have 
streaming event stream that is timestamped and I would like to aggregate it 
into, let's say, hourly buckets. Now the simple answer is to use a window 
operation with window length of 1 hr and sliding interval of 1hr. But this sort 
of doesn't exactly work:

1. The time boundaries aren't exactly perfect. i.e. the process/stream 
aggreagation may get started at the middle of the hour so the 1st hour may 
actually be less than 1 hour long and then subsequent hours should be aligned 
to the next hour.
2. The If I understand this correctly, the above method would mean that all my 
data is collected for 1 hour and then summarised. Though correct, how do I 
get the aggregations to occur more frequently than that. Something like 
aggregate these events into hourly buckets updating it every 5 seconds.

I would really appreciate pointers to code samples or some blogs that could 
help me identify best practices.

-- Ankur Chauhan


signature.asc
Description: Message signed with OpenPGP using GPGMail


RE: Spark streaming updating a large window more frequently

2015-05-08 Thread Mohammed Guller
If I understand you correctly, you need Window duration of 1 hour and sliding 
interval of 5 seconds. 

Mohammed


-Original Message-
From: Ankur Chauhan [mailto:achau...@brightcove.com] 
Sent: Friday, May 8, 2015 2:27 PM
To: u...@spark.incubator.apache.org
Subject: Spark streaming updating a large window more frequently

Hi,

I am pretty new to spark/spark_streaming so please excuse my naivety. I have 
streaming event stream that is timestamped and I would like to aggregate it 
into, let's say, hourly buckets. Now the simple answer is to use a window 
operation with window length of 1 hr and sliding interval of 1hr. But this sort 
of doesn't exactly work:

1. The time boundaries aren't exactly perfect. i.e. the process/stream 
aggreagation may get started at the middle of the hour so the 1st hour may 
actually be less than 1 hour long and then subsequent hours should be aligned 
to the next hour.
2. The If I understand this correctly, the above method would mean that all my 
data is collected for 1 hour and then summarised. Though correct, how do I 
get the aggregations to occur more frequently than that. Something like 
aggregate these events into hourly buckets updating it every 5 seconds.

I would really appreciate pointers to code samples or some blogs that could 
help me identify best practices.

-- Ankur Chauhan

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org