Hi Miao Miao,

Thanks for the response and solution idea.
I am not familiar with Pig (as I am still a beginner on hadoop & hive),
will check it out.
The simplest way which comes into my mind now is to awk the logs, and
create a csv file with the input values i want before i load it to my
hive table.
My hive table would look like the following, simple one without using 
SERDEPROPERTIES.
CREATE external TABLE logtable (
host STRING,
user STRING,
 request STRING)
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
STORED AS TEXTFILE;

Regards.

> I tried import apache2 log into hive a few weeks ago, and took a look
> at SERDEPROPERTIES, but it was too complicated and pasting others'
> demo wouldn't work.
> 
> Then I came up with another solution : apache2 log -> Apache Pig (for
> ETL) -> Hive external table. But I ran into a problem of Pig ( which
> was later solved with Cheolsoo Park's help), so I finally picked up
> NodeJS.
> 
> Node is actually quite great, some advanced features of JavaScript
> make ETL a lot easy and flexible, but it can't benefit from Hadoop.
> 
> So, to sum up, I suggest you do some ETL on logs before importing them
> into hive.
> 
> 1. ETL with Apache Pig:
> --- Pig script
> read = LOAD '/home/test/input/apacheLog'
> USING PigStorage(' ')
> --- space as separator
> AS (
>           ip:CHARARRAY
>         , indentity:CHARARRAY
>         , name:CHARARRAY
>         , date:CHARARRAY
>         , timezone:CHARARRAY
>         , method:CHARARRAY
>         , path:CHARARRAY
>         , protocol:CHARARRAY
>         , status:CHARARRAY
>         , size:CHARARRAY
> );
> data = FOREACH read GENERATE
>           ip
>         , REPLACE(date,'\\[','')
> --- Here be careful with [, it should be escape because it will cause
> --- a regex warning where Pig will throw away the whole field.
> --- (Not documented anywhere) REF: http://goo.gl/g1x1q
>         , REPLACE(timezone,']','')
>         , REPLACE(method,'"','')
>         , path
>         , REPLACE(protocol,'"','')
>         , status
>         , size;
> STORE data INTO '/home/test/output/apacheLog' USING PigStorage(' ');
> 2. Import into Hive as external tables or external partitions.
> 
> On Tue, Sep 4, 2012 at 1:08 PM, Elaine Gan <elaine-...@gmo.jp> wrote:
> > Hi,
> >
> > I would like to seek help on  loading logfiles to hive tables.
> >
> > I learnt from the "Getting Started" page that we could create hive
> > tables as follow to import apachelog into it.
> > ------------------------------------------------------------------
> > CREATE TABLE apachelog (
> >   host STRING,
> >   identity STRING,
> >   user STRING,
> >   time STRING,
> >   request STRING,
> >   status STRING,
> >   size STRING,
> >   referer STRING,
> >   agent STRING)
> > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> > WITH SERDEPROPERTIES (
> >   "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ 
> > \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ 
> > \"]*|\".*\"))?",
> >   "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
> > )
> > STORED AS TEXTFILE;
> > ------------------------------------------------------------------
> >
> > I was trying to do the same thing, but changing the value of my 
> > output.form.string,
> > let's say i only need, host, user, request.
> >
> > CREATE TABLE apachelog (
> >   host STRING,
> >   user STRING,
> >   request STRING)
> > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> > WITH SERDEPROPERTIES (
> >   "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ 
> > \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ 
> > \"]*|\".*\"))?",
> >   "output.format.string" = "%1$s %3$s %5$s"
> > )
> > STORED AS TEXTFILE;
> >
> > My questions are :
> > (1) I specified only %1, %3 %5 variables to be input into my table
> > column, but looks like hive load the first 3 variables into it (%1 %2
> > %3)
> > Is there no way that hive could only load the columns i want?
> >
> > (2) How can i skip lines which does not fit input.regex pattern match?
> >
> > Thank you.
> >
> > lai
> >
> >

――――――――――――――――――――――――――――――
GMOインターネット株式会社
次世代システム研究室
Elaine Gan <elaine-...@gmo.jp>
Skype: gan.elaine1
――――――――――――――――――――――――――――――
    ■ GMO INTERNET GROUP ■  http://www.gmo.jp/
――――――――――――――――――――――――――――――

Reply via email to