Re: Calculating Timeseries Aggregation

Sandip Mehta Thu, 19 Nov 2015 01:50:52 -0800

Thank you Sanket for the feedback.


Regards
SM
> On 19-Nov-2015, at 1:57 PM, Sanket Patil <[email protected]> wrote:
> 
> Hey Sandip:
> 
> TD has already outlined the right approach, but let me add a couple of 
> thoughts as I recently worked on a similar project. I had to compute some 
> real-time metrics on streaming data. Also, these metrics had to be aggregated 
> for hour/day/week/month. My data pipeline was Kafka --> Spark Streaming --> 
> Cassandra.
> 
> I had a spark streaming job that did the following: (1) receive a window of 
> raw streaming data and write it to Cassandra, and (2) do only the basic 
> computations that need to be shown on a real-time dashboard, and store the 
> results in Cassandra. (I had to use sliding window as my computation involved 
> joining data that might occur in different time windows.)
> 
> I had a separate set of Spark jobs that pulled the raw data from Cassandra, 
> computed the aggregations and more complex metrics, and wrote it back to the 
> relevant Cassandra tables. These jobs ran periodically every few minutes.
> 
> Regards,
> Sanket
> 
> On Thu, Nov 19, 2015 at 8:09 AM, Sandip Mehta <[email protected] 
> <mailto:[email protected]>> wrote:
> Thank you TD for your time and help.
> 
> SM
>> On 19-Nov-2015, at 6:58 AM, Tathagata Das <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> There are different ways to do the rollups. Either update rollups from the 
>> streaming application, or you can generate roll ups in a later process - say 
>> periodic Spark job every hour. Or you could just generate rollups on demand, 
>> when it is queried.
>> The whole thing depends on your downstream requirements - if you always to 
>> have up to date rollups to show up in dashboard (even day-level stuff), then 
>> the first approach is better. Otherwise, second and third approaches are 
>> more efficient.
>> 
>> TD
>> 
>> 
>> On Wed, Nov 18, 2015 at 7:15 AM, Sandip Mehta <[email protected] 
>> <mailto:[email protected]>> wrote:
>> TD thank you for your reply.
>> 
>> I agree on data store requirement. I am using HBase as an underlying store.
>> 
>> So for every batch interval of say 10 seconds
>> 
>> - Calculate the time dimension ( minutes, hours, day, week, month and 
>> quarter ) along with other dimensions and metrics
>> - Update relevant base table at each batch interval for relevant metrics for 
>> a given set of dimensions.
>> 
>> Only caveat I see is I’ll have to update each of the different roll up table 
>> for each batch window.
>> 
>> Is this a valid approach for calculating time series aggregation?
>> 
>> Regards
>> SM
>> 
>> For minutes level aggregates I have set up a streaming window say 10 seconds 
>> and storing minutes level aggregates across multiple dimension in HBase at 
>> every window interval. 
>> 
>>> On 18-Nov-2015, at 7:45 AM, Tathagata Das <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> For this sort of long term aggregations you should use a dedicated data 
>>> storage systems. Like a database, or a key-value store. Spark Streaming 
>>> would just aggregate and push the necessary data to the data store. 
>>> 
>>> TD
>>> 
>>> On Sat, Nov 14, 2015 at 9:32 PM, Sandip Mehta <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> Hi,
>>> 
>>> I am working on requirement of calculating real time metrics and building 
>>> prototype  on Spark streaming. I need to build aggregate at Seconds, 
>>> Minutes, Hours and Day level.
>>> 
>>> I am not sure whether I should calculate all these aggregates as  different 
>>> Windowed function on input DStream or shall I use updateStateByKey function 
>>> for the same. If I have to use updateStateByKey for these time series 
>>> aggregation, how can I remove keys from the state after different time 
>>> lapsed?
>>> 
>>> Please suggest.
>>> 
>>> Regards
>>> SM
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected] 
>>> <mailto:[email protected]>
>>> For additional commands, e-mail: [email protected] 
>>> <mailto:[email protected]>
>>> 
>>> 
>> 
>> 
> 
> 
> 
> 
> -- 
> SuperReceptionist is now available on Android mobiles. Track your business on 
> the go with call analytics, recordings, insights and more: Download the app 
> here <https://play.google.com/store/apps/details?id=com.superreceptionist>
> 
> SuperReceptionist is now available on Android mobiles. Track your business on 
> the go with call analytics, recordings, insights and more: Download the app 
> here <https://play.google.com/store/apps/details?id=com.superreceptionist>

Re: Calculating Timeseries Aggregation

Reply via email to