Re: loading logfile into hive tables using certain format

Elaine Gan Tue, 04 Sep 2012 18:30:04 -0700

Hi Rekha

Thank you for your response.
Now i can be sure that output.format.string doesnt support what i am
trying to do.


Thanks for providing solution of  creating an external table as staging
table in order to do this.
Will try it out.

Thank you.

> Hi lai,
> 
> Interesting, and ideally we must have a feature like below , but I don't
> think we have either partition-based UPDATE on metastore(MERGE/APPEND if
> partition already exists) or named/positional direct hdfs column filtering
> on SQL.
> CREATE TABLE c \xB3
> UPDATE TABLE c PARTITION(dt=<some-date>) SELECT $0, $3 from
> '/user/app/test.txt';
> 
> 
> But to your point, "output.format.string" as the name suggests is purely
> to provide format specifications for respective columns.
> And I think the column filtering must always happen at LOAD/INSERT phase
> not at CREATE phase.
> 
> However in some use-cases the table gets loaded with all columns as
> multiple processes need different columns, and your process does column
> pruning.If columns were not used at all in any process, there is a space
> cost.
> Most commonly used for filtering columns, is to create an external table
> as staging table and populating final table with selected columns,
> partitions - 
>  FROM <all-columns-staging-external-table> weblog
>     INSERT OVERWRITE TABLE <your-table> PARTITION(dt=<some-date>)
>     SELECT weblog.logTime, weblog.referrerUrl
>     WHERE weblog.campaignId = <some-campgn-If>
> 
> Some even use intermediate script, TRANSFORM and INSERT into final hive
> table.
> 
> input.regex must work for expected regular expression of record.Does it
> not? Try specifying a simpler regex and test with ..column_name=<regex>.. ?
> 
> 
> Thanks
> Rekha
> 
> On 04/09/12 10:38 AM, "Elaine Gan" <elaine-...@gmo.jp> wrote:
> 
> >Hi,
> >
> >I would like to seek help on  loading logfiles to hive tables.
> >
> >I learnt from the "Getting Started" page that we could create hive
> >tables as follow to import apachelog into it.
> >------------------------------------------------------------------
> >CREATE TABLE apachelog (
> >  host STRING,
> >  identity STRING,
> >  user STRING,
> >  time STRING,
> >  request STRING,
> >  status STRING,
> >  size STRING,
> >  referer STRING,
> >  agent STRING)
> >ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> >WITH SERDEPROPERTIES (
> >  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^
> >\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^
> >\"]*|\".*\"))?",
> >  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
> >)
> >STORED AS TEXTFILE;
> >------------------------------------------------------------------
> >
> >I was trying to do the same thing, but changing the value of my
> >output.form.string,
> >let's say i only need, host, user, request.
> >
> >CREATE TABLE apachelog (
> >  host STRING,
> >  user STRING,
> >  request STRING)
> >ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
> >WITH SERDEPROPERTIES (
> >  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^
> >\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^
> >\"]*|\".*\"))?",
> >  "output.format.string" = "%1$s %3$s %5$s"
> >)
> >STORED AS TEXTFILE;
> >
> >My questions are :
> >(1) I specified only %1, %3 %5 variables to be input into my table
> >column, but looks like hive load the first 3 variables into it (%1 %2
> >%3)
> >Is there no way that hive could only load the columns i want?
> >
> >(2) How can i skip lines which does not fit input.regex pattern match?
> >
> >Thank you.
> >
> >lai
> >
> >

Re: loading logfile into hive tables using certain format

Reply via email to