The answer for my problem was to use regexp_replace.

In this case just using an underscore to replace any characters outside of the 
expected range using the UTF8 character set table.

Instead of just selecting the column used this
regexp_replace(<column or field name>, '[^\\x00-\\x7f-\\x80-\\xad]', '_')

The ^ in the front is to select the inverse (or any characters not specified in 
the range of the square brackets []

(Make a note of this as I’m sure many will run into this issues in the future 
with various JSON data).

—Andries



On Mar 2, 2015, at 6:23 PM, Andries Engelbrecht <[email protected]> 
wrote:

> How can I convert JSON data with various characters in a text field to a more 
> usable UTF8 encoding? (even if characters not in UTF8 is dropped)
> Or if needs be filter out the records with characters not in UTF8?
> 
> CONVERT_TO is for binary and cast as varchar still passes these characters 
> through.
> 
> Results in sqlline works, but when using tools through ODBC various failures 
> occur (especially tools not capable of handling characters outside of UTF8).
> 
> 
> Thanks
> —Andries

Reply via email to