Re: Hive for large statistics tables?

Mark Grover Tue, 27 Sep 2011 07:50:14 -0700

Forgot to add the link to the video:
http://vimeo.com/8689411

Hi Benjamin,
Wojciech raised some good points but I believe that Hive/Hadoop can still be 
useful in your case.

MySQL solution that you presently have is not scalable. Hive is not a 
substitution for MySQL, it runs on Hadoop which is a distributed batch 
processing system. It will allow you to crunch *a lot* of data, amounts copious 
enough that stand-alone MySQL server wouldn't be able to deal with.

Many people (including myself) use Hive/hadoop in conjunction with a relational 
DB. They do much of the number crunching via Hive/Hadoop and then write the 
aggregates on a (fast-access) relational DB to provide quick access to those 
results. However, as Wojciech pointed out, ad-hoc queries on Hive would, in 
general, take longer than similar queries in MySQL. It was designed to deal 
with large amounts of data, so that's just an overhead we have to live with.

I'd suggest doing some background research on how much data you have and if 
Hive/hadoop really make sense. Here is a good video from Alex Loddengaard to 
get you started. A good slide (at 15:00) does a comparison of Hadoop with 
RDBMS. Later on (at 37:30), in the same video there is an example of typical 
workflow with Hive and Relational DB.

Check it out and good luck!

Mark

----- Original Message -----
From: "Wojciech Langiewicz" <wlangiew...@gmail.com>
To: user@hive.apache.org
Sent: Tuesday, September 27, 2011 9:33:53 AM
Subject: Re: Hive for large statistics tables?

Hello,
I'm using Hive to query data like yours. In my case I have about 300 - 
500GB data per day, so it is much larger. We use Flume to load data into 
Hive - data is rolled every day (this can be changed).

Hive queries - ad-hoc or scheduled usually take at least 10-20s or more 
(possibly hours) - it won't speed up your processing. Hive shows it 
power when you reach more data than serveral GB per month.

I think, that in your case Hive is not a good solution, you'll be better 
off using more powerful MySQL servers.

On 27.09.2011 11:14, Benjamin Fonze wrote:
> Dear All,
>
> I'm new to this list, and I hope I'm sending this to the right place.
>
> I'm currently using MySQL to store a large amount of visitor statistics.
> (Visits, clicks, etc....)
>
> Basically, each visit is logged in a text file, and every 15 minutes, a job
> consolidate it into MySQL, into tables that looks like this :
>
> COUNTRY | DATE | USER_AGENT | REFERRER | SEARCH | ... | NUM_HITS
>
> This generates million of rows a month, and several GB of data. Then, when
> querying these tables, it would typically take a few seconds. (Yes, there
> are indexes, etc...)
>
> I was thinking to move all that data to a noSQL DB like Hive, but I want to
> make sure it is adapted to my purpose. Can you confirm that Hive is a good
> fit for such statistical data. More importantly, can you confirm that ad-hoc
> queries on that data will be much faster that MySQL?
>
> Thanks in advance!
>
> Benjamin.
>

Re: Hive for large statistics tables?

Reply via email to