Антон Тагунов <[EMAIL PROTECTED]> wrote regarding Definition D5:

> Every time I read the following passage in
> http://www.unicode.org/unicode/uni2book/ch03.pdf
> I get confused:
>
> - A single abstract character may correspond to more then one code
>   value - ...
> - Multiple code values may be required to represent a single abstract
>   character.

I don't see a discrepancy between these two statements, at least not one
that the phrase "more than one code value sequence" would clarify.

>   For example, a byte is the code unit in SJIS:...
>   ideographs require two code values

I do think the text here is unclear about "code values" and "code
units."  It says they are the same thing, and then uses both terms
interchangeably, which is a bit confusing for a standard.

To me, a more useful distinction is the one in Technical Report #17,
"Character Encoding Model"
<http://www.unicode.org/unicode/reports/tr17/> between "code point" and
"code unit."  A code point is something like U+0410 for CYRILLIC CAPITAL
LETTER A.  Code units are the two bytes 0xD0 0x90 required to express
that code point in UTF-8, or the single 32-bit word 0x00000410 required
to express it in UTF-32.

Incorporating the concepts from UTR #17 into the main text is one place
where the "language tightening" project for Unicode 4.0 should really
pay off.

-Doug Ewell
 Fullerton, California



Reply via email to