Was having some issues with the hex codes on large data sets. Using the following better results for most western languages.
regexp_replace(<column>, '[^ -~¡-ÿ]', '°’) —Andries On Mar 12, 2015, at 8:24 AM, Andries Engelbrecht <[email protected]> wrote: > Just an update on this topic. > > The regexp_replace function uses a single escape \ and then the hex code to > function properly. > The ^ is for inverse, or basically for any characters that does not full > within the range specified to be replaced. > > I.e. > regexp_replace(<column_name>, '[^\x00-\xad]', '°’) > > > I used this UTF 8 table for hex codes. > http://www.utf8-chartable.de/unicode-utf8-table.pl > > —Andries > > > > On Mar 9, 2015, at 5:38 PM, Andries Engelbrecht <[email protected]> > wrote: > >> I should have clarified more. It turns out the encoding is UTF8, but various >> characters (since UTF8 supports all unicode and then some) within the >> encoding was causing issues for tools using Drill. >> Using regex_replace to replace undesired characters resolved the issue since >> the characters are not used and only caused issues. >> >> Thanks >> —Andries >> >> >> On Mar 9, 2015, at 12:42 PM, Ted Dunning <[email protected]> wrote: >> >>> This is dangerous if the source file does not contain UTF-8 but you treat >>> it as if it does. The problem is that characters with high bit set will be >>> treated as leading characters of a multi-byte UTF-8 character encoding. >>> >>> On Mon, Mar 9, 2015 at 11:51 AM, Andries Engelbrecht < >>> [email protected]> wrote: >>> >>>> The answer for my problem was to use regexp_replace. >>>> >>>> In this case just using an underscore to replace any characters outside of >>>> the expected range using the UTF8 character set table. >>>> >>>> Instead of just selecting the column used this >>>> regexp_replace(<column or field name>, '[^\\x00-\\x7f-\\x80-\\xad]', '_') >>>> >>>> The ^ in the front is to select the inverse (or any characters not >>>> specified in the range of the square brackets [] >>>> >>>> (Make a note of this as I’m sure many will run into this issues in the >>>> future with various JSON data). >>>> >>>> —Andries >>>> >>>> >>>> >>>> On Mar 2, 2015, at 6:23 PM, Andries Engelbrecht <[email protected]> >>>> wrote: >>>> >>>>> How can I convert JSON data with various characters in a text field to a >>>> more usable UTF8 encoding? (even if characters not in UTF8 is dropped) >>>>> Or if needs be filter out the records with characters not in UTF8? >>>>> >>>>> CONVERT_TO is for binary and cast as varchar still passes these >>>> characters through. >>>>> >>>>> Results in sqlline works, but when using tools through ODBC various >>>> failures occur (especially tools not capable of handling characters outside >>>> of UTF8). >>>>> >>>>> >>>>> Thanks >>>>> —Andries >>>> >>>> >> >
