Re: loading logfile into hive tables using certain format

Joshi, Rekha Tue, 04 Sep 2012 02:36:45 -0700

Hi lai,

Interesting, and ideally we must have a feature like below , but I don't
think we have either partition-based UPDATE on metastore(MERGE/APPEND if
partition already exists) or named/positional direct hdfs column filtering
on SQL.
CREATE TABLE c Š.
UPDATE TABLE c PARTITION(dt=<some-date>) SELECT $0, $3 from
'/user/app/test.txt';



But to your point, "output.format.string" as the name suggests is purely
to provide format specifications for respective columns.
And I think the column filtering must always happen at LOAD/INSERT phase
not at CREATE phase.

However in some use-cases the table gets loaded with all columns as
multiple processes need different columns, and your process does column
pruning.If columns were not used at all in any process, there is a space
cost.
Most commonly used for filtering columns, is to create an external table
as staging table and populating final table with selected columns,
partitions - 
 FROM <all-columns-staging-external-table> weblog
    INSERT OVERWRITE TABLE <your-table> PARTITION(dt=<some-date>)
    SELECT weblog.logTime, weblog.referrerUrl
    WHERE weblog.campaignId = <some-campgn-If>

Some even use intermediate script, TRANSFORM and INSERT into final hive
table.

input.regex must work for expected regular expression of record.Does it
not? Try specifying a simpler regex and test with ..column_name=<regex>.. ?


Thanks
Rekha

On 04/09/12 10:38 AM, "Elaine Gan" <elaine-...@gmo.jp> wrote:

>Hi,
>
>I would like to seek help on  loading logfiles to hive tables.
>
>I learnt from the "Getting Started" page that we could create hive
>tables as follow to import apachelog into it.
>------------------------------------------------------------------
>CREATE TABLE apachelog (
>  host STRING,
>  identity STRING,
>  user STRING,
>  time STRING,
>  request STRING,
>  status STRING,
>  size STRING,
>  referer STRING,
>  agent STRING)
>ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>WITH SERDEPROPERTIES (
>  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^
>\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^
>\"]*|\".*\"))?",
>  "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"
>)
>STORED AS TEXTFILE;
>------------------------------------------------------------------
>
>I was trying to do the same thing, but changing the value of my
>output.form.string,
>let's say i only need, host, user, request.
>
>CREATE TABLE apachelog (
>  host STRING,
>  user STRING,
>  request STRING)
>ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
>WITH SERDEPROPERTIES (
>  "input.regex" = "([^]*) ([^]*) ([^]*) (-|\\[^\\]*\\]) ([^
>\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^
>\"]*|\".*\"))?",
>  "output.format.string" = "%1$s %3$s %5$s"
>)
>STORED AS TEXTFILE;
>
>My questions are :
>(1) I specified only %1, %3 %5 variables to be input into my table
>column, but looks like hive load the first 3 variables into it (%1 %2
>%3)
>Is there no way that hive could only load the columns i want?
>
>(2) How can i skip lines which does not fit input.regex pattern match?
>
>Thank you.
>
>lai
>
>

Re: loading logfile into hive tables using certain format

Reply via email to