Frank Yung-Fong Tang <YTang0648 at aol dot com> wrote: > I think verdy_p's message is not very clear below. I think I know what he mean > but the message itself need some clearification.
And I, in turn, will try to clarify the clarification. :-) >>> If a UNICODE strings is converted to UTF8, will the UTF8 encoded >>> string contain and control character or escape sequences? If so, is >>> it possible to eliminate the same? >> >> UTF-8 sequences will not contain any C0 control bytes, > > I think you should say > "The UTF-8 seqences will not use C0 control code area (0x00-0x1F) to > represent characters. " instead of "UTF-8 sequences will not contain > any C0 control bytes, " because it is legal to have C0 control code > inside UTF-8, for example, TAB, CR, LF are all in c0 area and > perfectly legal in UTF-8. Neither is exactly right. UTF-8 does use the C0 control area, to represent C0 control characters. What I think everyone is trying to say is that UTF-8 does not use that area for *any other* characters, which of course was a basic design goal of UTF-8. >> but it will in many cases use contain C1 control bytes (between 0x80 >> and 0x9F). > > I think the rigth way to say is is "UTF-8 may use bytes 0x80 to 0x9F > as part of multiple byte UTF-8 byte serquence for a single Unicode > characters. And those bytes is defined as C1 control area. Therefore, > code code sequence with 0x80 and 0x9f should not be insert into UTF-8 > STREAM, but could be insert into UTF-16 STREAM (by using two bytes > 0x0080 - 0x009F) . No code sequence that is not valid UTF-8 should ever be inserted into a UTF-8 stream anyway. I don't see the point of this wording. >> UTF-8 keeps all 7-bit ASCII characters unchanged and does not create >> any sequence of bytes containing them for non 7-bit ASCII characters >> (all sequences of UTF-8 bytes are made of bytes>=0x80). UTF-8 will >> then never create any escape sequence. But it will retain any existing escape sequence consisting entirely of [ 7-bit] ASCII characters. >> But be warned that you should not create escape sequences containing >> bytes >= 0x80 after the leading escape (in this case, they may >> conflict with a UTF-8 decoder).. If your escape sequences are made >> only of 7-bit ASCII bytes, then this is safe, and you can mix plain- >> text ASCII, C0 controls, escape sequences and UTF-8 sequences for non >> ASCII characters. > > Not only "You should not create escape sequences containing bytes >= > 0x80 after the leading escape " but also "You should not create escape > sequences containing bytes >= 0x80 as the leading escape " We aren't in a position to tell Abdij not to use the C1 control area. It's his app. But for the record, he has stated that he isn't: > Yes, the control characters are entirely below 0x20 ASCII. so there is neither a problem nor a topic for discussion. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/

