Re: UTF coding in JSON docs

Andries Engelbrecht Fri, 13 Mar 2015 11:00:48 -0700

Was having some issues with the hex codes on large data sets.

Using the following better results for most western languages.


regexp_replace(<column>, '[^ -~¡-ÿ]', '°’)

—Andries


On Mar 12, 2015, at 8:24 AM, Andries Engelbrecht <[email protected]> 
wrote:

> Just an update on this topic.
> 
> The regexp_replace function uses a single escape \ and then the hex code to 
> function properly.
> The ^ is for inverse, or basically for any characters that does not full 
> within the range specified to be replaced.
> 
> I.e.
> regexp_replace(<column_name>, '[^\x00-\xad]', '°’)
> 
> 
> I used this UTF 8 table for hex codes.
> http://www.utf8-chartable.de/unicode-utf8-table.pl
> 
> —Andries 
> 
> 
> 
> On Mar 9, 2015, at 5:38 PM, Andries Engelbrecht <[email protected]> 
> wrote:
> 
>> I should have clarified more. It turns out the encoding is UTF8, but various 
>> characters (since UTF8 supports all unicode and then some) within the 
>> encoding was causing issues for tools using Drill.
>> Using regex_replace to replace undesired characters resolved the issue since 
>> the characters are not used and only caused issues.
>> 
>> Thanks
>> —Andries
>> 
>> 
>> On Mar 9, 2015, at 12:42 PM, Ted Dunning <[email protected]> wrote:
>> 
>>> This is dangerous if the source file does not contain UTF-8 but you treat
>>> it as if it does.  The problem is that characters with high bit set will be
>>> treated as leading characters of a multi-byte UTF-8 character encoding.
>>> 
>>> On Mon, Mar 9, 2015 at 11:51 AM, Andries Engelbrecht <
>>> [email protected]> wrote:
>>> 
>>>> The answer for my problem was to use regexp_replace.
>>>> 
>>>> In this case just using an underscore to replace any characters outside of
>>>> the expected range using the UTF8 character set table.
>>>> 
>>>> Instead of just selecting the column used this
>>>> regexp_replace(<column or field name>, '[^\\x00-\\x7f-\\x80-\\xad]', '_')
>>>> 
>>>> The ^ in the front is to select the inverse (or any characters not
>>>> specified in the range of the square brackets []
>>>> 
>>>> (Make a note of this as I’m sure many will run into this issues in the
>>>> future with various JSON data).
>>>> 
>>>> —Andries
>>>> 
>>>> 
>>>> 
>>>> On Mar 2, 2015, at 6:23 PM, Andries Engelbrecht <[email protected]>
>>>> wrote:
>>>> 
>>>>> How can I convert JSON data with various characters in a text field to a
>>>> more usable UTF8 encoding? (even if characters not in UTF8 is dropped)
>>>>> Or if needs be filter out the records with characters not in UTF8?
>>>>> 
>>>>> CONVERT_TO is for binary and cast as varchar still passes these
>>>> characters through.
>>>>> 
>>>>> Results in sqlline works, but when using tools through ODBC various
>>>> failures occur (especially tools not capable of handling characters outside
>>>> of UTF8).
>>>>> 
>>>>> 
>>>>> Thanks
>>>>> —Andries
>>>> 
>>>> 
>> 
>

Re: UTF coding in JSON docs

Reply via email to