Re: Storm/Trident as a distributed query engine?

Simon Chemouil Fri, 21 Mar 2014 10:06:22 -0700

Hi Adam,

Thanks a lot for your answer. I've checked both your pointers. I had
seen Trident's DRPC system before, just not sure how it fits in the
larger picture.


Druid surely seems to fit my use case (very) closely, so I'll try to run
it asap with my data to see if truly does.

I'm still open for feedback if someone else has an educated opinion
regarding Storm/Trident as a C* query engine.

Thanks!

Simon

Le 21/03/2014 15:39, Adam Lewis a écrit :
> When evaluating storm, definitely take a closer look at the DRPC
> mechanism in Trident for your use case.  To my knowledge there is no
> current support for data locality like you describe with Cassandra,
> although there was a discussion on the mailing list a couple of months
> ago around someone looking to do a school project and one of the popular
> suggestions was to implement a state-location-aware partitioning.
> 
> As far as other projects, have you taken a look at druid
> (http://druid.io/)?  It would represent an alternative to Cassandra in
> your current setup but is more suited to the types of multi-dimensional
> querying and aggregates you describe and can ingest sensor data in batch
> or realtime.
> 
> 
> On Fri, Mar 21, 2014 at 10:21 AM, Simon Chemouil <[email protected]
> <mailto:[email protected]>> wrote:
> 
>     Hi,
> 
>     I am very new to Storm and trying to evaluate whether it fits my needs
>     or not. I work on a project where we compute reasonably simple queries
>     (sum, average, mean, percentile...) on large amount of very simple
>     structured data (counters of many sensors with a value every 5 minutes).
>     We are currently reaching the limit of our architecture (still MySQL
>     based) and moving to Cassandra for our data store. We want to also
>     parallelize the queries to run on a cluster to be able to answer the
>     queries as fast as possible.
> 
>     While Cassandra seems to be a good fit for our needs of data storage
>     (quick access, good write performance, fault-tolerant, ...), we're still
>     looking for a component which could help us distribute our queries over
>     a cluster. I've been looking at Storm/Trident and running some
>     tests/examples for the last few days, and while I do believe we "could
>     make it happen", I would like to have the opinion of an experienced
>     Storm user/dev to know if it truly makes sense for our problem, since we
>     don't really have a continuous "stream" of data.
> 
>     First, in the short-term, we want to run "simple queries" over the
>     Cassandra store. I envision things this way:
>     query --> [ QUEUE ] --> [ distribute/process queries ] --> answer/output
> 
>     Queries are a discrete events, we don't want to keep state between them.
> 
>     We have some very simple queries and some more complex that require
>     going through a lot of data (tens of millions of 'cells'), so we want to
>     be able to *cut down* big queries in smaller pieces (most probably
>     divide them by time range) both to reply faster and to prevent big
>     queries from taking all resources.
> 
>     We would like to send the results of the query straight into another
>     Cassandra CF and to an endpoint in our system.
> 
>     Finally, because of some non-technical business requirements (i.e, our
>     clients' IT team reluctance to give us more servers ;))  we will have to
>     host the 'workers' on the same servers as Cassandra nodes. I thought it
>     could make sense to use Cassandra's token aware policy to always try to
>     make workers fetch data locally. This would allow us to piggyback on
>     Cassandra's load balancing since we use random partitioning that
>     normally evenly distributes the rows across our cluster, and a row is
>     small enough to compute on without breaking the task down further. Is it
>     possible with Storm/Trident to direct the way computations are
>     distributed (i.e, to which worker 'tasks' are sent) or is it going
>     against its design?
> 
>     All-in-all, how good a fit is Storm for this use case? What about
>     Trident? If the project isn't a good fit, do you know of other
>     open-source projects that address this need? The current alternative I
>     envision is designing a homebrew solution using Akka. Any opinion is
>     greatly appreciated!
> 
>     Thanks a lot for your help!
> 
>     Simon Chemouil
> 
>

Re: Storm/Trident as a distributed query engine?

Reply via email to