Hi,
True, Hadoop is slow, often due to the size of the data set (most queries are traversing the whole data set every time, although various partitioning and compression optimisations are possible). If we the data set is small enough to sit on one machine, then most of the time a good-old RDMS approach will be faster and more robust than anything Hadoop (but just not scalable). Hadoop can be hard to setup (namenode, datanode, mapred server, Hive, metastore, zookeeper...), although Cloudera and Hortonworks packaging makes things easier than they look. It's true that writing low level Hadoop map reduce jobs can quickly become hairy for non trivial use cases, but high level engines like Hive/impala/Presto and high level development API like Cascading/Crunch/JCascalog are actually very similar from the developer's point of view to the Storm Trident API: it's map/filter/group by thingies all the way :) Thanks for the pointer to Druid, it sounds interesting indeed, I feel that either Druid or Impala or PrestoDB might fulfil your requirement. Please keeps me posted of your conclusions if you investigate them. Cheers, S On Fri, Mar 21, 2014 at 4:42 PM, Simon Chemouil <[email protected]> wrote: > Hi Svend, > > Thanks for your answer. I read here and there that Hadoop M/R frameworks > were too slow for "real-time" queries, even though Shark and Spark are > said to be fast. I hadn't seen PrestoDB yet, it seems interesting. > > Your explanation of Storm cleared things up, thank you. Now you made me > rethink whether we should use Hadoop for the job or not! ;) I really > would like to run those queries as fast as possible (< 15s) and my > understanding was that the Hadoop stack was appropriate for more general > data analysis in batch that often takes time. > > An earlier reply showed me Druid.io which seems to do the kind of things > I'm trying, and in a blog post they explained that while Hadoop suited > their needs, it was too slow for running frequent user queries. They > ended up writing their own solution (Druid). So I'm a bit scared of > diving into Hadoop (which doesn't seem that straightforward) to see that > in the end it doesn't satisfies my requirements. > > Thanks a lot for sharing your experience, I greatly appreciate it. > > Simon > > > > Svend Vanderveken racontait le 21/03/2014 20:53: > > Simon, > > > > > > The use case you describe seems to fit better query technologies running > > on top of map-reduce-ish architectures like Hive, Impala, Shark, > > PrestoDB and Pig or their developer friendly counter parts like > > Cascading and Crunch (hereafter named "query technos"). > > > > Actually, all those query technos and Storm have exactly the same kinds > > of primitives (filtering, mapping, aggregating, grouping, > > persistence,,...), with different levels of flexibility and ease of > > development. > > > > The main difference is that Storm runs pre-existing queries, coded as a > > "topologies", on the new data when it comes in, whereas query technos > > run on-demand queries on pre-existing data. > > > > This implies that Storm is able to produce a result much sooner (almost > > as soon as the new data comes in), but has less power than full blown > > query technos (mostly, we cannot run most multi-step map reduce stuff > > like large joins). > > > > In conclusion, anything that can be designed as a one-step (no large > > join) map reduce and has be to computed systematically on all data that > > comes in can take advantage of real time streaming map-reduce approach > > like Storm. Otherwise we need good-old query technos. Both approach can > > be combined. > > > > A pattern like this in Storm would have several issues: > > > > " > > query --> [ QUEUE ] --> [ distribute/process queries in Storm ] --> > > answer/output > > " > > > > namely: > > > > * error handling can be very tricky, all the hadoop goodies like > > automatic re-run of failed task and progress tracking through the job > > tracker would not be present > > * the "distribute/process queries" part would eventually send a lot > > queries similar to "select Bla where id in [blabla]" to Cassandra. > > Unless the resulting data set is substancialy smaller that the full data > > set, an HDFS+Avro+Parquet approach would often be much faster. > > > > > > Actually, if I'm talking so much it's because I've exploring down that > > path before coming to the realisation above, so now that I'm back, > > consider this message as my "beware the leopard" sign contribution :) > > > > > > Cheers, > > > > S > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Fri, Mar 21, 2014 at 10:21 AM, Simon Chemouil <[email protected] > > <mailto:[email protected]>> wrote: > > > > Hi, > > > > I am very new to Storm and trying to evaluate whether it fits my > needs > > or not. I work on a project where we compute reasonably simple > queries > > (sum, average, mean, percentile...) on large amount of very simple > > structured data (counters of many sensors with a value every 5 > minutes). > > We are currently reaching the limit of our architecture (still MySQL > > based) and moving to Cassandra for our data store. We want to also > > parallelize the queries to run on a cluster to be able to answer the > > queries as fast as possible. > > > > While Cassandra seems to be a good fit for our needs of data storage > > (quick access, good write performance, fault-tolerant, ...), we're > still > > looking for a component which could help us distribute our queries > over > > a cluster. I've been looking at Storm/Trident and running some > > tests/examples for the last few days, and while I do believe we > "could > > make it happen", I would like to have the opinion of an experienced > > Storm user/dev to know if it truly makes sense for our problem, > since we > > don't really have a continuous "stream" of data. > > > > First, in the short-term, we want to run "simple queries" over the > > Cassandra store. I envision things this way: > > query --> [ QUEUE ] --> [ distribute/process queries ] --> > answer/output > > > > Queries are a discrete events, we don't want to keep state between > them. > > > > We have some very simple queries and some more complex that require > > going through a lot of data (tens of millions of 'cells'), so we > want to > > be able to *cut down* big queries in smaller pieces (most probably > > divide them by time range) both to reply faster and to prevent big > > queries from taking all resources. > > > > We would like to send the results of the query straight into another > > Cassandra CF and to an endpoint in our system. > > > > Finally, because of some non-technical business requirements (i.e, > our > > clients' IT team reluctance to give us more servers ;)) we will > have to > > host the 'workers' on the same servers as Cassandra nodes. I thought > it > > could make sense to use Cassandra's token aware policy to always try > to > > make workers fetch data locally. This would allow us to piggyback on > > Cassandra's load balancing since we use random partitioning that > > normally evenly distributes the rows across our cluster, and a row is > > small enough to compute on without breaking the task down further. > Is it > > possible with Storm/Trident to direct the way computations are > > distributed (i.e, to which worker 'tasks' are sent) or is it going > > against its design? > > > > All-in-all, how good a fit is Storm for this use case? What about > > Trident? If the project isn't a good fit, do you know of other > > open-source projects that address this need? The current alternative > I > > envision is designing a homebrew solution using Akka. Any opinion is > > greatly appreciated! > > > > Thanks a lot for your help! > > > > Simon Chemouil > > > > > >
