Re: UTF8 and COntrol Characters

Doug Ewell Tue, 04 Nov 2003 23:57:53 -0800

Abdij Bhat <Abdij dot Bhat at kshema dot com> wrote:

>  If a UNICODE strings is converted to UTF8, will the UTF8 encoded
> string contain and control character or escape sequences? If so, is it
> possible to eliminate the same?


By "UNICODE" I assume you mean UTF-16, which is one encoding form of
Unicode (as is UTF-8).

By "control character[s] or escape sequences" I assume you mean
characters below 0x20 in ASCII, or below U+0020 in Unicode (any encoding
form).  (That is, I assume we are not talking about the so-called C1
control characters from U+0080 to U+009F.)

A UTF-8 string will contain control characters if and only if the
corresponding UTF-16 string contains characters in the control range
(below U+0020).

For strings that consist entirely of ASCII characters, the UTF-8
representation is identical to the ASCII representation.

For the specification of UTF-8, including some examples that should help
answer your questions, see
http://www.unicode.org/versions/Unicode4.0.0/ch03.pdf (pp. 24-25).

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: UTF8 and COntrol Characters

Reply via email to