Hi lai, Interesting, and ideally we must have a feature like below , but I don't think we have either partition-based UPDATE on metastore(MERGE/APPEND if partition already exists) or named/positional direct hdfs column filtering on SQL. CREATE TABLE c Š. UPDATE TABLE c PARTITION(dt=<some-date>) SELECT $0, $3 from '/user/app/test.txt';
But to your point, "output.format.string" as the name suggests is purely to provide format specifications for respective columns. And I think the column filtering must always happen at LOAD/INSERT phase not at CREATE phase. However in some use-cases the table gets loaded with all columns as multiple processes need different columns, and your process does column pruning.If columns were not used at all in any process, there is a space cost. Most commonly used for filtering columns, is to create an external table as staging table and populating final table with selected columns, partitions - FROM <all-columns-staging-external-table> weblog INSERT OVERWRITE TABLE <your-table> PARTITION(dt=<some-date>) SELECT weblog.logTime, weblog.referrerUrl WHERE weblog.campaignId = <some-campgn-If> Some even use intermediate script, TRANSFORM and INSERT into final hive table. input.regex must work for expected regular expression of record.Does it not? Try specifying a simpler regex and test with ..column_name=<regex>.. ? Thanks Rekha On 04/09/12 10:38 AM, "Elaine Gan" <elaine-...@gmo.jp> wrote: >Hi, > >I would like to seek help on loading logfiles to hive tables. > >I learnt from the "Getting Started" page that we could create hive >tables as follow to import apachelog into it. >------------------------------------------------------------------ >CREATE TABLE apachelog ( > host STRING, > identity STRING, > user STRING, > time STRING, > request STRING, > status STRING, > size STRING, > referer STRING, > agent STRING) >ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' >WITH SERDEPROPERTIES ( > "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ >\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ >\"]*|\".*\"))?", > "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" >) >STORED AS TEXTFILE; >------------------------------------------------------------------ > >I was trying to do the same thing, but changing the value of my >output.form.string, >let's say i only need, host, user, request. > >CREATE TABLE apachelog ( > host STRING, > user STRING, > request STRING) >ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' >WITH SERDEPROPERTIES ( > "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ >\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ >\"]*|\".*\"))?", > "output.format.string" = "%1$s %3$s %5$s" >) >STORED AS TEXTFILE; > >My questions are : >(1) I specified only %1, %3 %5 variables to be input into my table >column, but looks like hive load the first 3 variables into it (%1 %2 >%3) >Is there no way that hive could only load the columns i want? > >(2) How can i skip lines which does not fit input.regex pattern match? > >Thank you. > >lai > >