RE: Regex and serde with hive

Raghunath, Ranjith Fri, 23 Dec 2011 12:03:34 -0800

Thanks Mark, Loren, and Vijay for helping out with this. I did try the serde 
configuration and it worked well. I did come across the performance 
implications as I increased the volume so I went ahead and cleansed the data 
prior to inserting it. Thanks again.

Thank you, 
Ranjith 

-----Original Message-----
From: Mark Grover [mailto:mgro...@oanda.com] 
Sent: Friday, December 23, 2011 9:29 AM
To: user@hive.apache.org
Subject: Re: Regex and serde with hive

Hi Ranjith,
Like Loren, I don't think the regex you are using is correct.

If you use a create table command like the following, it should work:
create external table
my_table(a STRING, b STRING, c STRING)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES  (
"input.regex" = "\"([^\"]*)\"~\"([^\"]*)\"~\"([^\"]*)\"",
"output.format.string" = "a:%1$s,b:%2$s,c:%3$s"
)
STORED AS TEXTFILE LOCATION 'my_location';

In general, it's good practice to test your regex before using it in the create 
table statement. Write a small test or use a website like this 
(http://www.regexplanet.com/simple/index.html) to test your regex on the data.

Keep in mind that Regex SerDe is not the most optimal SerDe (in terms of 
performance) and you might at a later stage want to re-think of what other 
SerDe's you can use.

As far as using "FIELDS TERMINATED BY" goes, you could use something like 
FIELDS TERMINATED BY ~ (or "~"), but you will still have to take care of the 
double quote(s).

Mark

Mark Grover, Business Intelligence Analyst
OANDA Corporation

www: oanda.com www: fxtrade.com
e: mgro...@oanda.com

"Best Trading Platform" - World Finance's Forex Awards 2009.
"The One to Watch" - Treasury Today's Adam Smith Awards 2009.

----- Original Message -----
From: "Loren Siebert" <lo...@siebert.org>
To: user@hive.apache.org
Sent: Friday, December 23, 2011 2:27:15 AM
Subject: Re: Regex and serde with hive

The input regexp does not look right to me. You are expecting a space between 
groups, but your example contains no spaces. And where do you handle the 
first/last quotes? Wouldn’t it look more like this:
"input.regex" = “\"([^\"~]*)[\"~]*([^\"~]*)[\"~]*([^\"~]*)\""

Rather than trying to tackle it all at once, I find it easier to start with a 
table of one column and then build up from there until I have all my columns.

On Dec 22, 2011, at 8:49 PM, Raghunath, Ranjith wrote:

I have been struggling with this for a while so I would appreciate any advice 
that you any of you may have.

I have a file of the format

“Xyz”~”qsd”~”1234”

I created the following table definition to get the data loaded

CREATE TABLE dummy
(f1 string,
f2 string,
f3 string)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "([^\"~]*) ([^\"~]*) ([^\"~]*)?",
"output.format.string" = "%1$s %2$s %3$s");

When I load the data in and try to perform a select get NULL values. Thanks 
again.

Thank you,
Ranjith

RE: Regex and serde with hive

Reply via email to