Hi, I am very new to Storm and trying to evaluate whether it fits my needs or not. I work on a project where we compute reasonably simple queries (sum, average, mean, percentile...) on large amount of very simple structured data (counters of many sensors with a value every 5 minutes). We are currently reaching the limit of our architecture (still MySQL based) and moving to Cassandra for our data store. We want to also parallelize the queries to run on a cluster to be able to answer the queries as fast as possible.
While Cassandra seems to be a good fit for our needs of data storage (quick access, good write performance, fault-tolerant, ...), we're still looking for a component which could help us distribute our queries over a cluster. I've been looking at Storm/Trident and running some tests/examples for the last few days, and while I do believe we "could make it happen", I would like to have the opinion of an experienced Storm user/dev to know if it truly makes sense for our problem, since we don't really have a continuous "stream" of data. First, in the short-term, we want to run "simple queries" over the Cassandra store. I envision things this way: query --> [ QUEUE ] --> [ distribute/process queries ] --> answer/output Queries are a discrete events, we don't want to keep state between them. We have some very simple queries and some more complex that require going through a lot of data (tens of millions of 'cells'), so we want to be able to *cut down* big queries in smaller pieces (most probably divide them by time range) both to reply faster and to prevent big queries from taking all resources. We would like to send the results of the query straight into another Cassandra CF and to an endpoint in our system. Finally, because of some non-technical business requirements (i.e, our clients' IT team reluctance to give us more servers ;)) we will have to host the 'workers' on the same servers as Cassandra nodes. I thought it could make sense to use Cassandra's token aware policy to always try to make workers fetch data locally. This would allow us to piggyback on Cassandra's load balancing since we use random partitioning that normally evenly distributes the rows across our cluster, and a row is small enough to compute on without breaking the task down further. Is it possible with Storm/Trident to direct the way computations are distributed (i.e, to which worker 'tasks' are sent) or is it going against its design? All-in-all, how good a fit is Storm for this use case? What about Trident? If the project isn't a good fit, do you know of other open-source projects that address this need? The current alternative I envision is designing a homebrew solution using Akka. Any opinion is greatly appreciated! Thanks a lot for your help! Simon Chemouil
