Forgot to add the link to the video: http://vimeo.com/8689411
Hi Benjamin, Wojciech raised some good points but I believe that Hive/Hadoop can still be useful in your case. MySQL solution that you presently have is not scalable. Hive is not a substitution for MySQL, it runs on Hadoop which is a distributed batch processing system. It will allow you to crunch *a lot* of data, amounts copious enough that stand-alone MySQL server wouldn't be able to deal with. Many people (including myself) use Hive/hadoop in conjunction with a relational DB. They do much of the number crunching via Hive/Hadoop and then write the aggregates on a (fast-access) relational DB to provide quick access to those results. However, as Wojciech pointed out, ad-hoc queries on Hive would, in general, take longer than similar queries in MySQL. It was designed to deal with large amounts of data, so that's just an overhead we have to live with. I'd suggest doing some background research on how much data you have and if Hive/hadoop really make sense. Here is a good video from Alex Loddengaard to get you started. A good slide (at 15:00) does a comparison of Hadoop with RDBMS. Later on (at 37:30), in the same video there is an example of typical workflow with Hive and Relational DB. Check it out and good luck! Mark ----- Original Message ----- From: "Wojciech Langiewicz" <wlangiew...@gmail.com> To: user@hive.apache.org Sent: Tuesday, September 27, 2011 9:33:53 AM Subject: Re: Hive for large statistics tables? Hello, I'm using Hive to query data like yours. In my case I have about 300 - 500GB data per day, so it is much larger. We use Flume to load data into Hive - data is rolled every day (this can be changed). Hive queries - ad-hoc or scheduled usually take at least 10-20s or more (possibly hours) - it won't speed up your processing. Hive shows it power when you reach more data than serveral GB per month. I think, that in your case Hive is not a good solution, you'll be better off using more powerful MySQL servers. On 27.09.2011 11:14, Benjamin Fonze wrote: > Dear All, > > I'm new to this list, and I hope I'm sending this to the right place. > > I'm currently using MySQL to store a large amount of visitor statistics. > (Visits, clicks, etc....) > > Basically, each visit is logged in a text file, and every 15 minutes, a job > consolidate it into MySQL, into tables that looks like this : > > COUNTRY | DATE | USER_AGENT | REFERRER | SEARCH | ... | NUM_HITS > > This generates million of rows a month, and several GB of data. Then, when > querying these tables, it would typically take a few seconds. (Yes, there > are indexes, etc...) > > I was thinking to move all that data to a noSQL DB like Hive, but I want to > make sure it is adapted to my purpose. Can you confirm that Hive is a good > fit for such statistical data. More importantly, can you confirm that ad-hoc > queries on that data will be much faster that MySQL? > > Thanks in advance! > > Benjamin. >