Hi Bobby, Thanks for your answer. Are you using Storm for this purpose or building something new?
Thanks, Simon Bobby Chowdary racontait le 21/03/2014 19:07: > Hi Simon, > We are working on very similar application, currently we > are using Redis as endpoint, but ideally want to have a RESTful > application that can submit queries and get results back (something > similar to what ooyala jobserver for spark > (https://github.com/ooyala/spark-jobserver). So far results have been > encouraging but we are still in early stage of evaluating this approach. > > Thanks > > > On Mar 21, 2014, at 10:05 AM, Simon Chemouil <[email protected]> wrote: > >> Hi Adam, >> >> Thanks a lot for your answer. I've checked both your pointers. I had >> seen Trident's DRPC system before, just not sure how it fits in the >> larger picture. >> >> Druid surely seems to fit my use case (very) closely, so I'll try to run >> it asap with my data to see if truly does. >> >> I'm still open for feedback if someone else has an educated opinion >> regarding Storm/Trident as a C* query engine. >> >> Thanks! >> >> Simon >> >> Le 21/03/2014 15:39, Adam Lewis a écrit : >> > When evaluating storm, definitely take a closer look at the DRPC >> > mechanism in Trident for your use case. To my knowledge there >> is no >> > current support for data locality like you describe with >> Cassandra, >> > although there was a discussion on the mailing list a couple >> of months >> > ago around someone looking to do a school project and one of >> the popular >> > suggestions was to implement a state-location-aware partitioning. >> > >> > As far as other projects, have you taken a look at druid >> > (http://druid.io/)? It would represent an alternative to >> Cassandra in >> > your current setup but is more suited to the types of >> multi-dimensional >> > querying and aggregates you describe and can ingest sensor >> data in batch >> > or realtime. >> > >> > >> > On Fri, Mar 21, 2014 at 10:21 AM, Simon Chemouil >> <[email protected] <mailto:[email protected]> >> > <mailto:[email protected] <mailto:[email protected]> > >> > wrote: >> > >> > Hi, >> > >> > I am very new to Storm and trying to evaluate whether it fits >> my needs >> > or not. I work on a project where we compute reasonably >> simple queries >> > (sum, average, mean, percentile...) on large amount of very >> simple >> > structured data (counters of many sensors with a value every >> 5 minutes). >> > We are currently reaching the limit of our architecture >> (still MySQL >> > based) and moving to Cassandra for our data store. We want to >> also >> > parallelize the queries to run on a cluster to be able to >> answer the >> > queries as fast as possible. >> > >> > While Cassandra seems to be a good fit for our needs of data >> storage >> > (quick access, good write performance, fault-tolerant, ...), >> we're still >> > looking for a component which could help us distribute our >> queries over >> > a cluster. I've been looking at Storm/Trident and running some >> > tests/examples for the last few days, and while I do believe >> we "could >> > make it happen", I would like to have the opinion of an >> experienced >> > Storm user/dev to know if it truly makes sense for our >> problem, since we >> > don't really have a continuous "stream" of data. >> > >> > First, in the short-term, we want to run "simple queries" >> over the >> > Cassandra store. I envision things this way: >> > query -- > [ QUEUE ] -- > [ distribute/process queries >> ] -- > answer/output >> > >> > Queries are a discrete events, we don't want to keep state >> between them. >> > >> > We have some very simple queries and some more complex that >> require >> > going through a lot of data (tens of millions of 'cells'), so >> we want to >> > be able to *cut down* big queries in smaller pieces (most >> probably >> > divide them by time range) both to reply faster and to >> prevent big >> > queries from taking all resources. >> > >> > We would like to send the results of the query straight into >> another >> > Cassandra CF and to an endpoint in our system. >> > >> > Finally, because of some non-technical business requirements >> (i.e, our >> > clients' IT team reluctance to give us more servers ;)) we >> will have to >> > host the 'workers' on the same servers as Cassandra nodes. I >> thought it >> > could make sense to use Cassandra's token aware policy to >> always try to >> > make workers fetch data locally. This would allow us to >> piggyback on >> > Cassandra's load balancing since we use random partitioning that >> > normally evenly distributes the rows across our cluster, and >> a row is >> > small enough to compute on without breaking the task down >> further. Is it >> > possible with Storm/Trident to direct the way computations are >> > distributed (i.e, to which worker 'tasks' are sent) or is it >> going >> > against its design? >> > >> > All-in-all, how good a fit is Storm for this use case? What about >> > Trident? If the project isn't a good fit, do you know of other >> > open-source projects that address this need? The current >> alternative I >> > envision is designing a homebrew solution using Akka. Any >> opinion is >> > greatly appreciated! >> > >> > Thanks a lot for your help! >> > >> > Simon Chemouil >> > >> >
