Re: regexp_replace with unicode chars

Mark Grover Fri, 01 Mar 2013 10:04:06 -0800

The translate UDF does take care of non-ascii characters. It uses
codepoints instead of characters.


Here is the unit test to demonstrate that:
https://github.com/apache/hive/blob/trunk/ql/src/test/queries/clientpositive/udf_translate.q#L36

But you guys are right. It doesn't solve Tom's original problem
because it doesn't take ranges.

I created HIVE-4100 for improving regex_replace UDF.


Mark

On Fri, Mar 1, 2013 at 9:31 AM, Dean Wampler
<dean.wamp...@thinkbiganalytics.com> wrote:
> Anyone know if translate takes ranges, like some implementations? e.g.,
>
> translate ('[a-z]', '[A-Z]')
>
> Of course, that probably doesn't work for non-ascii characters.
>
>
> On Fri, Mar 1, 2013 at 11:24 AM, Tom Hall <thattommyh...@gmail.com> wrote:
>>
>> Thanks Dean,
>>
>> I dont think translate would work as the set of things to remove is
>> massive.
>> Yeah, it's a one-off cleanup job while exporting to try redshift on our
>> datasets.
>> My guess is it's something about the way hive handles strings? Tried
>> "\\ufffd" as the replacement str but no joy either.
>>
>> Cheers again,
>> Tom
>>
>>
>>
>> On 1 March 2013 17:08, Dean Wampler <dean.wamp...@thinkbiganalytics.com>
>> wrote:
>>>
>>> I think this should work, but you might investigate using the translate
>>> function instead. I suspect it will provide much better performance than
>>> using regexps. Also, Are you planning to do this once to create your final
>>> tables? If so, the performance overhead won't matter much.
>>>
>>> dean
>>>
>>>
>>> On Fri, Mar 1, 2013 at 10:52 AM, Tom Hall <thattommyh...@gmail.com>
>>> wrote:
>>>>
>>>> I would like to remove unicode chars that are outside the Basic
>>>> Multilingual Plane [1]
>>>>
>>>> I thought
>>>> select regexp_replace(some_column,"[^\\u0000-\\uffff]","\ufffd") from
>>>> my_table
>>>> would work but while the regexp does work the replacement str does not
>>>> (I can paste in the literal �, which you may or may not be able to see here
>>>> but it somehow did not fell right)
>>>>
>>>> I saw Deans previous post on using octals [2] but I think \ufffd is
>>>> outside the allowable range.
>>>>
>>>> Cheers,
>>>> Tom
>>>>
>>>>
>>>> [1]
>>>> http://en.wikipedia.org/wiki/Plane_%28Unicode%29#Basic_Multilingual_Plane
>>>> [2]
>>>> http://grokbase.com/t/hive/dev/131a4n562y/unicode-character-as-delimiter
>>>
>>>
>>>
>>>
>>> --
>>> Dean Wampler, Ph.D.
>>> thinkbiganalytics.com
>>> +1-312-339-1330
>>>
>>
>
>
>
> --
> Dean Wampler, Ph.D.
> thinkbiganalytics.com
> +1-312-339-1330
>

Re: regexp_replace with unicode chars

Reply via email to