From: "Abdij Bhat" <[EMAIL PROTECTED]> > If a UNICODE strings is converted to UTF8, will the UTF8 encoded string > contain and control character or escape sequences? If so, is it possible to > eliminate the same?
UTF-8 sequences will not contain any C0 control bytes, but it will in many cases use contain C1 control bytes (between 0x80 and 0x9F). UTF-8 keeps all 7-bit ASCII characters unchanged and does not create any sequence of bytes containing them for non 7-bit ASCII characters (all sequences of UTF-8 bytes are made of bytes>=0x80). UTF-8 will then never create any escape sequence. But be warned that you should not create escape sequences containing bytes >= 0x80 after the leading escape (in this case, they may conflict with a UTF-8 decoder).. If your escape sequences are made only of 7-bit ASCII bytes, then this is safe, and you can mix plain-text ASCII, C0 controls, escape sequences and UTF-8 sequences for non ASCII characters. Note that C1 controls of Unicode and ISO-8859-* will be converted to a pair of bytes in UTF-8, with the first byte being 0xC2, and the second byte varying between 0x80 and 0x9F (so C1 controls will appear in UTF-8 with a 0xC2 "prefix" before the same byte when encoding them with ISO-8859-*)

