Re: UTF8 and COntrol Characters

Philippe Verdy Wed, 05 Nov 2003 05:17:29 -0800

From: "Abdij Bhat" <[EMAIL PROTECTED]>
>  If a UNICODE strings is converted to UTF8, will the UTF8 encoded string
> contain and control character or escape sequences? If so, is it possible
to
> eliminate the same?


UTF-8 sequences will not contain any C0 control bytes, but it will in many
cases use contain C1 control bytes (between 0x80 and 0x9F).

UTF-8 keeps all 7-bit ASCII characters unchanged and does not create any
sequence of bytes containing them for non 7-bit ASCII characters (all
sequences of UTF-8 bytes are made of bytes>=0x80). UTF-8 will then never
create any escape sequence.

But be warned that you should not create escape sequences containing bytes
>= 0x80 after the leading escape (in this case, they may conflict with a
UTF-8 decoder).. If your escape sequences are made only of 7-bit ASCII
bytes, then this is safe, and you can mix plain-text ASCII, C0 controls,
escape sequences and UTF-8 sequences for non ASCII characters.

Note that C1 controls of Unicode and ISO-8859-* will be converted to a pair
of bytes in UTF-8, with the first byte being 0xC2, and the second byte
varying between 0x80 and 0x9F (so C1 controls will appear in UTF-8 with a
0xC2 "prefix" before the same byte when encoding them with ISO-8859-*)

Re: UTF8 and COntrol Characters

Reply via email to