Strategy for Loading Apache Logs

bichonfrise74 Tue, 10 May 2011 10:56:40 -0700

Hi,

My end goal is to load the daily Apache logs. I wish to partition it by date
and the group has helped me in giving some advices, but it seems that I am
still stuck.


My "daily Apache logs" can contain dates for 2 days ago, yesterday, and
today. So, what I did was I created a staging_weblog table and used hive
(using SerDe) to load the logs without any partition. Then I ran a select
distinct to get all the unique dates. And I converted this date into
YYYY-MM-DD format.

After this I created a multi-insert statement like this:
from staging_weblog
insert overwrite table real_weblog
partition ( logdate = '<yyyy-mm-dd>')
select * where regexp_extract( logtime, '([^:]*):.*', 1) = 'dd/MMM/yyyy'

The above works fine if there is no existing partition. But if there is an
existing partition, then it 'overwrites' and replace the old partition with
the new one. The 'overwrite' keyword is mandatory based on the
documentation. But I wish to just append the data to the existing partition.

Has anyone solved this problem before?

Or let me ask a general question, how do you load your daily Apache logs
into Hadoop so that you can use Hive to process the data?

Strategy for Loading Apache Logs

Reply via email to