Re: Storm/Trident as a distributed query engine?

Simon Chemouil Fri, 21 Mar 2014 13:15:28 -0700

Hi Bobby,

Thanks for your answer. Are you using Storm for this purpose or building
something new?


Thanks,

Simon

Bobby Chowdary racontait le 21/03/2014 19:07:
> Hi Simon,
>                 We are working on very similar application, currently we
> are using Redis as endpoint, but ideally want to have a RESTful
> application that can submit queries and get results back (something
> similar to what ooyala jobserver for spark
> (https://github.com/ooyala/spark-jobserver). So far results have been
> encouraging but we are still in early stage of evaluating this approach.
> 
> Thanks
> 
> 
> On Mar 21, 2014, at 10:05 AM, Simon Chemouil <[email protected]> wrote:
> 
>> Hi Adam,
>>
>> Thanks a lot for your answer. I've checked both your pointers. I had
>> seen Trident's DRPC system before, just not sure how it fits in the
>> larger picture.
>>
>> Druid surely seems to fit my use case (very) closely, so I'll try to run
>> it asap with my data to see if truly does.
>>
>> I'm still open for feedback if someone else has an educated opinion
>> regarding Storm/Trident as a C* query engine.
>>
>> Thanks!
>>
>> Simon
>>
>> Le 21/03/2014 15:39, Adam Lewis a écrit :
>>        > When evaluating storm, definitely take a closer look at the DRPC
>>        > mechanism in Trident for your use case. To my knowledge there
>> is no
>>        > current support for data locality like you describe with
>> Cassandra,
>>        > although there was a discussion on the mailing list a couple
>> of months
>>        > ago around someone looking to do a school project and one of
>> the popular
>>        > suggestions was to implement a state-location-aware partitioning.
>>        >
>>        > As far as other projects, have you taken a look at druid
>>        > (http://druid.io/)? It would represent an alternative to
>> Cassandra in
>>        > your current setup but is more suited to the types of
>> multi-dimensional
>>        > querying and aggregates you describe and can ingest sensor
>> data in batch
>>        > or realtime.
>>        >
>>        >
>>        > On Fri, Mar 21, 2014 at 10:21 AM, Simon Chemouil
>> <[email protected] <mailto:[email protected]>
>>        > <mailto:[email protected] <mailto:[email protected]>  >  
>>  > wrote:
>>        >
>>        > Hi,
>>        >
>>        > I am very new to Storm and trying to evaluate whether it fits
>> my needs
>>        > or not. I work on a project where we compute reasonably
>> simple queries
>>        > (sum, average, mean, percentile...) on large amount of very
>> simple
>>        > structured data (counters of many sensors with a value every
>> 5 minutes).
>>        > We are currently reaching the limit of our architecture
>> (still MySQL
>>        > based) and moving to Cassandra for our data store. We want to
>> also
>>        > parallelize the queries to run on a cluster to be able to
>> answer the
>>        > queries as fast as possible.
>>        >
>>        > While Cassandra seems to be a good fit for our needs of data
>> storage
>>        > (quick access, good write performance, fault-tolerant, ...),
>> we're still
>>        > looking for a component which could help us distribute our
>> queries over
>>        > a cluster. I've been looking at Storm/Trident and running some
>>        > tests/examples for the last few days, and while I do believe
>> we "could
>>        > make it happen", I would like to have the opinion of an
>> experienced
>>        > Storm user/dev to know if it truly makes sense for our
>> problem, since we
>>        > don't really have a continuous "stream" of data.
>>        >
>>        > First, in the short-term, we want to run "simple queries"
>> over the
>>        > Cassandra store. I envision things this way:
>>        > query --      > [ QUEUE ] --  > [ distribute/process queries
>> ] --        > answer/output
>>        >
>>        > Queries are a discrete events, we don't want to keep state
>> between them.
>>        >
>>        > We have some very simple queries and some more complex that
>> require
>>        > going through a lot of data (tens of millions of 'cells'), so
>> we want to
>>        > be able to *cut down* big queries in smaller pieces (most
>> probably
>>        > divide them by time range) both to reply faster and to
>> prevent big
>>        > queries from taking all resources.
>>        >
>>        > We would like to send the results of the query straight into
>> another
>>        > Cassandra CF and to an endpoint in our system.
>>        >
>>        > Finally, because of some non-technical business requirements
>> (i.e, our
>>        > clients' IT team reluctance to give us more servers ;)) we
>> will have to
>>        > host the 'workers' on the same servers as Cassandra nodes. I
>> thought it
>>        > could make sense to use Cassandra's token aware policy to
>> always try to
>>        > make workers fetch data locally. This would allow us to
>> piggyback on
>>        > Cassandra's load balancing since we use random partitioning that
>>        > normally evenly distributes the rows across our cluster, and
>> a row is
>>        > small enough to compute on without breaking the task down
>> further. Is it
>>        > possible with Storm/Trident to direct the way computations are
>>        > distributed (i.e, to which worker 'tasks' are sent) or is it
>> going
>>        > against its design?
>>        >
>>        > All-in-all, how good a fit is Storm for this use case? What about
>>        > Trident? If the project isn't a good fit, do you know of other
>>        > open-source projects that address this need? The current
>> alternative I
>>        > envision is designing a homebrew solution using Akka. Any
>> opinion is
>>        > greatly appreciated!
>>        >
>>        > Thanks a lot for your help!
>>        >
>>        > Simon Chemouil
>>        >
>>        >

Re: Storm/Trident as a distributed query engine?

Reply via email to