This is dangerous if the source file does not contain UTF-8 but you treat
it as if it does.  The problem is that characters with high bit set will be
treated as leading characters of a multi-byte UTF-8 character encoding.

On Mon, Mar 9, 2015 at 11:51 AM, Andries Engelbrecht <
[email protected]> wrote:

> The answer for my problem was to use regexp_replace.
>
> In this case just using an underscore to replace any characters outside of
> the expected range using the UTF8 character set table.
>
> Instead of just selecting the column used this
> regexp_replace(<column or field name>, '[^\\x00-\\x7f-\\x80-\\xad]', '_')
>
> The ^ in the front is to select the inverse (or any characters not
> specified in the range of the square brackets []
>
> (Make a note of this as I’m sure many will run into this issues in the
> future with various JSON data).
>
> —Andries
>
>
>
> On Mar 2, 2015, at 6:23 PM, Andries Engelbrecht <[email protected]>
> wrote:
>
> > How can I convert JSON data with various characters in a text field to a
> more usable UTF8 encoding? (even if characters not in UTF8 is dropped)
> > Or if needs be filter out the records with characters not in UTF8?
> >
> > CONVERT_TO is for binary and cast as varchar still passes these
> characters through.
> >
> > Results in sqlline works, but when using tools through ODBC various
> failures occur (especially tools not capable of handling characters outside
> of UTF8).
> >
> >
> > Thanks
> > —Andries
>
>

Reply via email to