Hi Rekha Thank you for your response. Now i can be sure that output.format.string doesnt support what i am trying to do.
Thanks for providing solution of creating an external table as staging table in order to do this. Will try it out. Thank you. > Hi lai, > > Interesting, and ideally we must have a feature like below , but I don't > think we have either partition-based UPDATE on metastore(MERGE/APPEND if > partition already exists) or named/positional direct hdfs column filtering > on SQL. > CREATE TABLE c \xB3 > UPDATE TABLE c PARTITION(dt=<some-date>) SELECT $0, $3 from > '/user/app/test.txt'; > > > But to your point, "output.format.string" as the name suggests is purely > to provide format specifications for respective columns. > And I think the column filtering must always happen at LOAD/INSERT phase > not at CREATE phase. > > However in some use-cases the table gets loaded with all columns as > multiple processes need different columns, and your process does column > pruning.If columns were not used at all in any process, there is a space > cost. > Most commonly used for filtering columns, is to create an external table > as staging table and populating final table with selected columns, > partitions - > FROM <all-columns-staging-external-table> weblog > INSERT OVERWRITE TABLE <your-table> PARTITION(dt=<some-date>) > SELECT weblog.logTime, weblog.referrerUrl > WHERE weblog.campaignId = <some-campgn-If> > > Some even use intermediate script, TRANSFORM and INSERT into final hive > table. > > input.regex must work for expected regular expression of record.Does it > not? Try specifying a simpler regex and test with ..column_name=<regex>.. ? > > > Thanks > Rekha > > On 04/09/12 10:38 AM, "Elaine Gan" <elaine-...@gmo.jp> wrote: > > >Hi, > > > >I would like to seek help on loading logfiles to hive tables. > > > >I learnt from the "Getting Started" page that we could create hive > >tables as follow to import apachelog into it. > >------------------------------------------------------------------ > >CREATE TABLE apachelog ( > > host STRING, > > identity STRING, > > user STRING, > > time STRING, > > request STRING, > > status STRING, > > size STRING, > > referer STRING, > > agent STRING) > >ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' > >WITH SERDEPROPERTIES ( > > "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ > >\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ > >\"]*|\".*\"))?", > > "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s" > >) > >STORED AS TEXTFILE; > >------------------------------------------------------------------ > > > >I was trying to do the same thing, but changing the value of my > >output.form.string, > >let's say i only need, host, user, request. > > > >CREATE TABLE apachelog ( > > host STRING, > > user STRING, > > request STRING) > >ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' > >WITH SERDEPROPERTIES ( > > "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^ > >\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ > >\"]*|\".*\"))?", > > "output.format.string" = "%1$s %3$s %5$s" > >) > >STORED AS TEXTFILE; > > > >My questions are : > >(1) I specified only %1, %3 %5 variables to be input into my table > >column, but looks like hive load the first 3 variables into it (%1 %2 > >%3) > >Is there no way that hive could only load the columns i want? > > > >(2) How can i skip lines which does not fit input.regex pattern match? > > > >Thank you. > > > >lai > > > >