I tried import apache2 log into hive a few weeks ago, and took a look
at SERDEPROPERTIES, but it was too complicated and pasting others'
demo wouldn't work.

Then I came up with another solution : apache2 log -> Apache Pig (for
ETL) -> Hive external table. But I ran into a problem of Pig ( which
was later solved with Cheolsoo Park's help), so I finally picked up
NodeJS.

Node is actually quite great, some advanced features of JavaScript
make ETL a lot easy and flexible, but it can't benefit from Hadoop.

So, to sum up, I suggest you do some ETL on logs before importing them
into hive.

1. ETL with Apache Pig:
--- Pig script
read = LOAD '/home/test/input/apacheLog'
USING PigStorage(' ')
--- space as separator
AS (
          ip:CHARARRAY
        , indentity:CHARARRAY
        , name:CHARARRAY
        , date:CHARARRAY
        , timezone:CHARARRAY
        , method:CHARARRAY
        , path:CHARARRAY
        , protocol:CHARARRAY
        , status:CHARARRAY
        , size:CHARARRAY
);
data = FOREACH read GENERATE
          ip
        , REPLACE(date,'\\[','')
--- Here be careful with [, it should be escape because it will cause
--- a regex warning where Pig will throw away the whole field.
--- (Not documented anywhere) REF: http://goo.gl/g1x1q
        , REPLACE(timezone,']','')
        , REPLACE(method,'"','')
        , path
        , REPLACE(protocol,'"','')
        , status
        , size;
STORE data INTO '/home/test/output/apacheLog' USING PigStorage(' ');
2. Import into Hive as external tables or external partitions.

On Tue, Sep 4, 2012 at 1:08 PM, Elaine Gan <elaine-...@gmo.jp> wrote:
> Hi,
>
> I would like to seek help on  loading logfiles to hive tables.
>
> I learnt from the "Getting Started" page that we could create hive
> tables as follow to import apachelog into it.
> ------------------------------------------------------------------
> CREATE TABLE apachelog (
>   host STRING,
>   identity STRING,
>   user STRING,
>   time STRING,
>   request STRING,
>   status STRING,
>   size STRING,
>   referer STRING,
>   agent STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES (
>   "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") 
> (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
>   "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
> )
> STORED AS TEXTFILE;
> ------------------------------------------------------------------
>
> I was trying to do the same thing, but changing the value of my 
> output.form.string,
> let's say i only need, host, user, request.
>
> CREATE TABLE apachelog (
>   host STRING,
>   user STRING,
>   request STRING)
> ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> WITH SERDEPROPERTIES (
>   "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") 
> (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?",
>   "output.format.string" = "%1$s %3$s %5$s"
> )
> STORED AS TEXTFILE;
>
> My questions are :
> (1) I specified only %1, %3 %5 variables to be input into my table
> column, but looks like hive load the first 3 variables into it (%1 %2
> %3)
> Is there no way that hive could only load the columns i want?
>
> (2) How can i skip lines which does not fit input.regex pattern match?
>
> Thank you.
>
> lai
>
>

Reply via email to