Thanks Mark, Loren, and Vijay for helping out with this. I did try the serde configuration and it worked well. I did come across the performance implications as I increased the volume so I went ahead and cleansed the data prior to inserting it. Thanks again.
Thank you, Ranjith -----Original Message----- From: Mark Grover [mailto:mgro...@oanda.com] Sent: Friday, December 23, 2011 9:29 AM To: user@hive.apache.org Subject: Re: Regex and serde with hive Hi Ranjith, Like Loren, I don't think the regex you are using is correct. If you use a create table command like the following, it should work: create external table my_table(a STRING, b STRING, c STRING) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "\"([^\"]*)\"~\"([^\"]*)\"~\"([^\"]*)\"", "output.format.string" = "a:%1$s,b:%2$s,c:%3$s" ) STORED AS TEXTFILE LOCATION 'my_location'; In general, it's good practice to test your regex before using it in the create table statement. Write a small test or use a website like this (http://www.regexplanet.com/simple/index.html) to test your regex on the data. Keep in mind that Regex SerDe is not the most optimal SerDe (in terms of performance) and you might at a later stage want to re-think of what other SerDe's you can use. As far as using "FIELDS TERMINATED BY" goes, you could use something like FIELDS TERMINATED BY ~ (or "~"), but you will still have to take care of the double quote(s). Mark Mark Grover, Business Intelligence Analyst OANDA Corporation www: oanda.com www: fxtrade.com e: mgro...@oanda.com "Best Trading Platform" - World Finance's Forex Awards 2009. "The One to Watch" - Treasury Today's Adam Smith Awards 2009. ----- Original Message ----- From: "Loren Siebert" <lo...@siebert.org> To: user@hive.apache.org Sent: Friday, December 23, 2011 2:27:15 AM Subject: Re: Regex and serde with hive The input regexp does not look right to me. You are expecting a space between groups, but your example contains no spaces. And where do you handle the first/last quotes? Wouldn’t it look more like this: "input.regex" = “\"([^\"~]*)[\"~]*([^\"~]*)[\"~]*([^\"~]*)\"" Rather than trying to tackle it all at once, I find it easier to start with a table of one column and then build up from there until I have all my columns. On Dec 22, 2011, at 8:49 PM, Raghunath, Ranjith wrote: I have been struggling with this for a while so I would appreciate any advice that you any of you may have. I have a file of the format “Xyz”~”qsd”~”1234” I created the following table definition to get the data loaded CREATE TABLE dummy (f1 string, f2 string, f3 string) ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' WITH SERDEPROPERTIES ( "input.regex" = "([^\"~]*) ([^\"~]*) ([^\"~]*)?", "output.format.string" = "%1$s %2$s %3$s"); When I load the data in and try to perform a select get NULL values. Thanks again. Thank you, Ranjith