Hi Miao Miao, Thanks for the response and solution idea. I am not familiar with Pig (as I am still a beginner on hadoop & hive), will check it out. The simplest way which comes into my mind now is to awk the logs, and create a csv file with the input values i want before i load it to my hive table. My hive table would look like the following, simple one without using SERDEPROPERTIES. CREATE external TABLE logtable ( host STRING, user STRING, request STRING) ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS TEXTFILE;
Regards. > I tried import apache2 log into hive a few weeks ago, and took a look > at SERDEPROPERTIES, but it was too complicated and pasting others' > demo wouldn't work. > > Then I came up with another solution : apache2 log -> Apache Pig (for > ETL) -> Hive external table. But I ran into a problem of Pig ( which > was later solved with Cheolsoo Park's help), so I finally picked up > NodeJS. > > Node is actually quite great, some advanced features of JavaScript > make ETL a lot easy and flexible, but it can't benefit from Hadoop. > > So, to sum up, I suggest you do some ETL on logs before importing them > into hive. > > 1. ETL with Apache Pig: > --- Pig script > read = LOAD '/home/test/input/apacheLog' > USING PigStorage(' ') > --- space as separator > AS ( > ip:CHARARRAY > , indentity:CHARARRAY > , name:CHARARRAY > , date:CHARARRAY > , timezone:CHARARRAY > , method:CHARARRAY > , path:CHARARRAY > , protocol:CHARARRAY > , status:CHARARRAY > , size:CHARARRAY > ); > data = FOREACH read GENERATE > ip > , REPLACE(date,'\\[','') > --- Here be careful with [, it should be escape because it will cause > --- a regex warning where Pig will throw away the whole field. > --- (Not documented anywhere) REF: http://goo.gl/g1x1q > , REPLACE(timezone,']','') > , REPLACE(method,'"','') > , path > , REPLACE(protocol,'"','') > , status > , size; > STORE data INTO '/home/test/output/apacheLog' USING PigStorage(' '); > 2. Import into Hive as external tables or external partitions. > > On Tue, Sep 4, 2012 at 1:08 PM, Elaine Gan <elaine-...@gmo.jp> wrote: > > Hi, > > > > I would like to seek help on loading logfiles to hive tables. > > > > I learnt from the "Getting Started" page that we could create hive > > tables as follow to import apachelog into it. > > ------------------------------------------------------------------ > > CREATE TABLE apachelog ( > > host STRING, > > identity STRING, > > user STRING, > > time STRING, > > request STRING, > > status STRING, > > size STRING, > > referer STRING, > > agent STRING) > > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' > > WITH SERDEPROPERTIES ( > > "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ > > \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ > > \"]*|\".*\"))?", > > "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" > > ) > > STORED AS TEXTFILE; > > ------------------------------------------------------------------ > > > > I was trying to do the same thing, but changing the value of my > > output.form.string, > > let's say i only need, host, user, request. > > > > CREATE TABLE apachelog ( > > host STRING, > > user STRING, > > request STRING) > > ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' > > WITH SERDEPROPERTIES ( > > "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ > > \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ > > \"]*|\".*\"))?", > > "output.format.string" = "%1$s %3$s %5$s" > > ) > > STORED AS TEXTFILE; > > > > My questions are : > > (1) I specified only %1, %3 %5 variables to be input into my table > > column, but looks like hive load the first 3 variables into it (%1 %2 > > %3) > > Is there no way that hive could only load the columns i want? > > > > (2) How can i skip lines which does not fit input.regex pattern match? > > > > Thank you. > > > > lai > > > > ―――――――――――――――――――――――――――――― GMOインターネット株式会社 次世代システム研究室 Elaine Gan <elaine-...@gmo.jp> Skype: gan.elaine1 ―――――――――――――――――――――――――――――― ■ GMO INTERNET GROUP ■ http://www.gmo.jp/ ――――――――――――――――――――――――――――――