I tried import apache2 log into hive a few weeks ago, and took a look at SERDEPROPERTIES, but it was too complicated and pasting others' demo wouldn't work.
Then I came up with another solution : apache2 log -> Apache Pig (for ETL) -> Hive external table. But I ran into a problem of Pig ( which was later solved with Cheolsoo Park's help), so I finally picked up NodeJS. Node is actually quite great, some advanced features of JavaScript make ETL a lot easy and flexible, but it can't benefit from Hadoop. So, to sum up, I suggest you do some ETL on logs before importing them into hive. 1. ETL with Apache Pig: --- Pig script read = LOAD '/home/test/input/apacheLog' USING PigStorage(' ') --- space as separator AS ( ip:CHARARRAY , indentity:CHARARRAY , name:CHARARRAY , date:CHARARRAY , timezone:CHARARRAY , method:CHARARRAY , path:CHARARRAY , protocol:CHARARRAY , status:CHARARRAY , size:CHARARRAY ); data = FOREACH read GENERATE ip , REPLACE(date,'\\[','') --- Here be careful with [, it should be escape because it will cause --- a regex warning where Pig will throw away the whole field. --- (Not documented anywhere) REF: http://goo.gl/g1x1q , REPLACE(timezone,']','') , REPLACE(method,'"','') , path , REPLACE(protocol,'"','') , status , size; STORE data INTO '/home/test/output/apacheLog' USING PigStorage(' '); 2. Import into Hive as external tables or external partitions. On Tue, Sep 4, 2012 at 1:08 PM, Elaine Gan <elaine-...@gmo.jp> wrote: > Hi, > > I would like to seek help on loading logfiles to hive tables. > > I learnt from the "Getting Started" page that we could create hive > tables as follow to import apachelog into it. > ------------------------------------------------------------------ > CREATE TABLE apachelog ( > host STRING, > identity STRING, > user STRING, > time STRING, > request STRING, > status STRING, > size STRING, > referer STRING, > agent STRING) > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' > WITH SERDEPROPERTIES ( > "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") > (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?", > "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" > ) > STORED AS TEXTFILE; > ------------------------------------------------------------------ > > I was trying to do the same thing, but changing the value of my > output.form.string, > let's say i only need, host, user, request. > > CREATE TABLE apachelog ( > host STRING, > user STRING, > request STRING) > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' > WITH SERDEPROPERTIES ( > "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") > (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?", > "output.format.string" = "%1$s %3$s %5$s" > ) > STORED AS TEXTFILE; > > My questions are : > (1) I specified only %1, %3 %5 variables to be input into my table > column, but looks like hive load the first 3 variables into it (%1 %2 > %3) > Is there no way that hive could only load the columns i want? > > (2) How can i skip lines which does not fit input.regex pattern match? > > Thank you. > > lai > >