Hi, any hint ? is my previous question unclear ? I try to reformulate: - how should I manage my shared buffers and data required for moving/sliding windows management and nearest neighbours data aggregation ? - would I really benefit from moving to storm vs my current script considering that in-memory data management highly speed up my current process ?
Thank you, Xavier On Thu, Apr 24, 2014 at 7:40 PM, Xavier Daull <[email protected]> wrote: > I have already developed a Python script (not using storm) which > transforms a stream of millions of prices history of different items > (provided in 1 common csv) and output dedicate streams for each item with > enriched data in real-time. This script computes and aggregates in > real-time latest item price with past data to get moving average and slop > over different timeframes (month/week/day/hour) and add to it latest data > from nearest items (neighbours). The goal is to feed models for price > prediction. In order to manage time aggregated data and nearest neighbours > data I use a shared buffer of recent data needed for aggregation, latest > computed data for each item and some shared timestamp indexes. > > I am wondering if I would really benefit from moving this script to storm > and how. > > My first understanding of storm is I should: > - create a dedicated spout class to fetch prices data. > - create a dedicated bolt class to aggregate data (moving average / slopes > / cross aggregated data between items). > > Where should I put my shared buffers and data required to efficiently > aggregate and compute my time aggregated data and nearest neighbours data ? > > Will the topology impact performance compared to in-memory data management > ? My current script, even if it is in Python, highly benefits from > efficient buffered computation (no recompute, use delta average...), few > data manipulation, minimum access to memory and computation. > > Thank you for your advice. > Xavier >
