The translate UDF does take care of non-ascii characters. It uses codepoints instead of characters.
Here is the unit test to demonstrate that: https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/udf_translate.q#L36 But you guys are right. It doesn't solve Tom's original problem because it doesn't take ranges. I created HIVE-4100 for improving regex_replace UDF. Mark On Fri, Mar 1, 2013 at 9:31 AM, Dean Wampler <dean.wamp...@thinkbiganalytics.com> wrote: > Anyone know if translate takes ranges, like some implementations? e.g., > > translate ('[a-z]', '[A-Z]') > > Of course, that probably doesn't work for non-ascii characters. > > > On Fri, Mar 1, 2013 at 11:24 AM, Tom Hall <thattommyh...@gmail.com> wrote: >> >> Thanks Dean, >> >> I dont think translate would work as the set of things to remove is >> massive. >> Yeah, it's a one-off cleanup job while exporting to try redshift on our >> datasets. >> My guess is it's something about the way hive handles strings? Tried >> "\\ufffd" as the replacement str but no joy either. >> >> Cheers again, >> Tom >> >> >> >> On 1 March 2013 17:08, Dean Wampler <dean.wamp...@thinkbiganalytics.com> >> wrote: >>> >>> I think this should work, but you might investigate using the translate >>> function instead. I suspect it will provide much better performance than >>> using regexps. Also, Are you planning to do this once to create your final >>> tables? If so, the performance overhead won't matter much. >>> >>> dean >>> >>> >>> On Fri, Mar 1, 2013 at 10:52 AM, Tom Hall <thattommyh...@gmail.com> >>> wrote: >>>> >>>> I would like to remove unicode chars that are outside the Basic >>>> Multilingual Plane [1] >>>> >>>> I thought >>>> select regexp_replace(some_column,"[^\\u0000-\\uffff]","\ufffd") from >>>> my_table >>>> would work but while the regexp does work the replacement str does not >>>> (I can paste in the literal �, which you may or may not be able to see here >>>> but it somehow did not fell right) >>>> >>>> I saw Deans previous post on using octals [2] but I think \ufffd is >>>> outside the allowable range. >>>> >>>> Cheers, >>>> Tom >>>> >>>> >>>> [1] >>>> http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane >>>> [2] >>>> http://grokbase.com/t/hive/dev/131a4n562y/unicode-character-as-delimiter >>> >>> >>> >>> >>> -- >>> Dean Wampler, Ph.D. >>> thinkbiganalytics.com >>> +1-312-339-1330 >>> >> > > > > -- > Dean Wampler, Ph.D. > thinkbiganalytics.com > +1-312-339-1330 >