Just an update on this topic.

The regexp_replace function uses a single escape \ and then the hex code to 
function properly.
The ^ is for inverse, or basically for any characters that does not full within 
the range specified to be replaced.

I.e.
regexp_replace(<column_name>, '[^\x00-\xad]', '°’)


I used this UTF 8 table for hex codes.
http://www.utf8-chartable.de/unicode-utf8-table.pl

—Andries 



On Mar 9, 2015, at 5:38 PM, Andries Engelbrecht <[email protected]> 
wrote:

> I should have clarified more. It turns out the encoding is UTF8, but various 
> characters (since UTF8 supports all unicode and then some) within the 
> encoding was causing issues for tools using Drill.
> Using regex_replace to replace undesired characters resolved the issue since 
> the characters are not used and only caused issues.
> 
> Thanks
> —Andries
> 
> 
> On Mar 9, 2015, at 12:42 PM, Ted Dunning <[email protected]> wrote:
> 
>> This is dangerous if the source file does not contain UTF-8 but you treat
>> it as if it does.  The problem is that characters with high bit set will be
>> treated as leading characters of a multi-byte UTF-8 character encoding.
>> 
>> On Mon, Mar 9, 2015 at 11:51 AM, Andries Engelbrecht <
>> [email protected]> wrote:
>> 
>>> The answer for my problem was to use regexp_replace.
>>> 
>>> In this case just using an underscore to replace any characters outside of
>>> the expected range using the UTF8 character set table.
>>> 
>>> Instead of just selecting the column used this
>>> regexp_replace(<column or field name>, '[^\\x00-\\x7f-\\x80-\\xad]', '_')
>>> 
>>> The ^ in the front is to select the inverse (or any characters not
>>> specified in the range of the square brackets []
>>> 
>>> (Make a note of this as I’m sure many will run into this issues in the
>>> future with various JSON data).
>>> 
>>> —Andries
>>> 
>>> 
>>> 
>>> On Mar 2, 2015, at 6:23 PM, Andries Engelbrecht <[email protected]>
>>> wrote:
>>> 
>>>> How can I convert JSON data with various characters in a text field to a
>>> more usable UTF8 encoding? (even if characters not in UTF8 is dropped)
>>>> Or if needs be filter out the records with characters not in UTF8?
>>>> 
>>>> CONVERT_TO is for binary and cast as varchar still passes these
>>> characters through.
>>>> 
>>>> Results in sqlline works, but when using tools through ODBC various
>>> failures occur (especially tools not capable of handling characters outside
>>> of UTF8).
>>>> 
>>>> 
>>>> Thanks
>>>> —Andries
>>> 
>>> 
> 

Reply via email to