Storm/Trident as a distributed query engine?

Simon Chemouil Fri, 21 Mar 2014 07:22:32 -0700

Hi,

I am very new to Storm and trying to evaluate whether it fits my needs
or not. I work on a project where we compute reasonably simple queries
(sum, average, mean, percentile...) on large amount of very simple
structured data (counters of many sensors with a value every 5 minutes).
We are currently reaching the limit of our architecture (still MySQL
based) and moving to Cassandra for our data store. We want to also
parallelize the queries to run on a cluster to be able to answer the
queries as fast as possible.


While Cassandra seems to be a good fit for our needs of data storage
(quick access, good write performance, fault-tolerant, ...), we're still
looking for a component which could help us distribute our queries over
a cluster. I've been looking at Storm/Trident and running some
tests/examples for the last few days, and while I do believe we "could
make it happen", I would like to have the opinion of an experienced
Storm user/dev to know if it truly makes sense for our problem, since we
don't really have a continuous "stream" of data.

First, in the short-term, we want to run "simple queries" over the
Cassandra store. I envision things this way:
query --> [ QUEUE ] --> [ distribute/process queries ] --> answer/output

Queries are a discrete events, we don't want to keep state between them.

We have some very simple queries and some more complex that require
going through a lot of data (tens of millions of 'cells'), so we want to
be able to *cut down* big queries in smaller pieces (most probably
divide them by time range) both to reply faster and to prevent big
queries from taking all resources.

We would like to send the results of the query straight into another
Cassandra CF and to an endpoint in our system.

Finally, because of some non-technical business requirements (i.e, our
clients' IT team reluctance to give us more servers ;))  we will have to
host the 'workers' on the same servers as Cassandra nodes. I thought it
could make sense to use Cassandra's token aware policy to always try to
make workers fetch data locally. This would allow us to piggyback on
Cassandra's load balancing since we use random partitioning that
normally evenly distributes the rows across our cluster, and a row is
small enough to compute on without breaking the task down further. Is it
possible with Storm/Trident to direct the way computations are
distributed (i.e, to which worker 'tasks' are sent) or is it going
against its design?

All-in-all, how good a fit is Storm for this use case? What about
Trident? If the project isn't a good fit, do you know of other
open-source projects that address this need? The current alternative I
envision is designing a homebrew solution using Akka. Any opinion is
greatly appreciated!

Thanks a lot for your help!

Simon Chemouil

Storm/Trident as a distributed query engine?

Reply via email to