Just an update on this topic. The regexp_replace function uses a single escape \ and then the hex code to function properly. The ^ is for inverse, or basically for any characters that does not full within the range specified to be replaced.
I.e. regexp_replace(<column_name>, '[^\x00-\xad]', '°’) I used this UTF 8 table for hex codes. http://www.utf8-chartable.de/unicode-utf8-table.pl —Andries On Mar 9, 2015, at 5:38 PM, Andries Engelbrecht <[email protected]> wrote: > I should have clarified more. It turns out the encoding is UTF8, but various > characters (since UTF8 supports all unicode and then some) within the > encoding was causing issues for tools using Drill. > Using regex_replace to replace undesired characters resolved the issue since > the characters are not used and only caused issues. > > Thanks > —Andries > > > On Mar 9, 2015, at 12:42 PM, Ted Dunning <[email protected]> wrote: > >> This is dangerous if the source file does not contain UTF-8 but you treat >> it as if it does. The problem is that characters with high bit set will be >> treated as leading characters of a multi-byte UTF-8 character encoding. >> >> On Mon, Mar 9, 2015 at 11:51 AM, Andries Engelbrecht < >> [email protected]> wrote: >> >>> The answer for my problem was to use regexp_replace. >>> >>> In this case just using an underscore to replace any characters outside of >>> the expected range using the UTF8 character set table. >>> >>> Instead of just selecting the column used this >>> regexp_replace(<column or field name>, '[^\\x00-\\x7f-\\x80-\\xad]', '_') >>> >>> The ^ in the front is to select the inverse (or any characters not >>> specified in the range of the square brackets [] >>> >>> (Make a note of this as I’m sure many will run into this issues in the >>> future with various JSON data). >>> >>> —Andries >>> >>> >>> >>> On Mar 2, 2015, at 6:23 PM, Andries Engelbrecht <[email protected]> >>> wrote: >>> >>>> How can I convert JSON data with various characters in a text field to a >>> more usable UTF8 encoding? (even if characters not in UTF8 is dropped) >>>> Or if needs be filter out the records with characters not in UTF8? >>>> >>>> CONVERT_TO is for binary and cast as varchar still passes these >>> characters through. >>>> >>>> Results in sqlline works, but when using tools through ODBC various >>> failures occur (especially tools not capable of handling characters outside >>> of UTF8). >>>> >>>> >>>> Thanks >>>> —Andries >>> >>> >
