This is dangerous if the source file does not contain UTF-8 but you treat it as if it does. The problem is that characters with high bit set will be treated as leading characters of a multi-byte UTF-8 character encoding.
On Mon, Mar 9, 2015 at 11:51 AM, Andries Engelbrecht < [email protected]> wrote: > The answer for my problem was to use regexp_replace. > > In this case just using an underscore to replace any characters outside of > the expected range using the UTF8 character set table. > > Instead of just selecting the column used this > regexp_replace(<column or field name>, '[^\\x00-\\x7f-\\x80-\\xad]', '_') > > The ^ in the front is to select the inverse (or any characters not > specified in the range of the square brackets [] > > (Make a note of this as I’m sure many will run into this issues in the > future with various JSON data). > > —Andries > > > > On Mar 2, 2015, at 6:23 PM, Andries Engelbrecht <[email protected]> > wrote: > > > How can I convert JSON data with various characters in a text field to a > more usable UTF8 encoding? (even if characters not in UTF8 is dropped) > > Or if needs be filter out the records with characters not in UTF8? > > > > CONVERT_TO is for binary and cast as varchar still passes these > characters through. > > > > Results in sqlline works, but when using tools through ODBC various > failures occur (especially tools not capable of handling characters outside > of UTF8). > > > > > > Thanks > > —Andries > >
